Energy forecasting is a vital part of today's energy markets and operations. It is used to set the cost of electricity, determine when power generation should be increased, and decide when power needs to be directed to or from other regions of the electrical grid. Electrical energy demand continues to rise, especially with the recent widespread adoption of electric vehicles and heat pumps. As a result, the electrical grid will need to become more robust and adaptive as this increase in demand comes in tandem with a transition away from fossil fuels and into a more diversified energy market and with it, more complex cycles. These circumstances make accurate forecasting and periodicity identification of electrical loads and generation that much more important, and it is necessary to research the differences in forecasting applications and relevance in various short-term, medium-term, and long-term settings.
Many of the statistical time series forecasting techniques of the past are still in use today, each with their own strengths and weaknesses, often trading compute efficiency with lesser accuracy in long-term forecast horizons or ability to capture multiple seasonal periods. More modern machine learning models and neural network-based architectures tend to generalize through forecasting horizons better, but often at great cost of compute resources making them inefficient for short forecast horizon forecasts. Though because of the potential for complexity of energy load patterns, both categories see benefits in task-specific tuning. By evaluating each forecasting model's accuracy across various context windows and forecasting horizons, this project presents a thorough comparison on which techniques are most relevant in the task-specific energy grid context and each forecast horizon.
Statistical, machine learning, and deep learning-based methods compared across 5 time series and 6 forecast horizon lengths.
Cody Hill
April 2024
Github Repo: https://github.com/chill0121/Energy_Grid_Forecasting
Electrical load and generation data was gathered from Pennsylvania-New Jersey-Maryland Interconnection, now called PJM, which is a regional transmission organization (RTO) that coordinates movement of electricity in 13 states and Washington, DC.
Data Info:
Hourly frequency.
Feature Information
Weather data was sourced from National Oceanic and Atmospheric Administration's (NOAA) Regional Climate Center (RCC) Applied Climate Information System (ACIS). The data was manually downloaded from https://scacis.rcc-acis.org/# by selecting a weather station and choosing relevant dates and features with the daily data listings form.
Data Info:
Daily frequency.
zone (selected in section 3.4.).Ideally, the sampling rate would match the hourly load data, however, only daily weather data is available at this time.
A weather station was chosen for each zone (selected in section 3.4.). The criteria considered for the weather station was:
Weather Stations Chosen:
AE: MILLVILLE MUNICIPAL AIRPORT, NJ CE: CHICAGO OHARE INTL AP, ILDOM: RICHMOND INTERNATIONAL AP, VAJC: NEW BRUNSWICK 3 SE, NJPEP: WASHINGTON REAGAN NATIONAL AIRPORT, VA
(Map Source: https://www.pjm.com/-/media/about-pjm/pjm-zones.ashx)
import os
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.dates import AutoDateLocator, AutoDateFormatter
from scipy.fft import fft
from sklearn.model_selection import GridSearchCV, PredefinedSplit
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import statsmodels.api as sm
from statsforecast.models import MSTL, AutoARIMA, HoltWinters
from statsforecast.utils import ConformalIntervals
#from prophet import Prophet
from sklearn.svm import SVR
import xgboost as xgb
from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS, TFT
from neuralforecast.losses.pytorch import DistributionLoss
import torch
Note:
Unable to resolve error that causes the kernel to crash when running N-HiTS when importing both statsforecast.models import MSTL, AutoARIMA, HoltWinters AND neuralforecast.models import NHITS, TFT.
In the N-HiTS section, the model can be trained, saved, and commented out. Then return the statsforecast models and use the load code block in the N-HiTS section to load predictions.
print(f"Python version: {sys.version}")
packages = [pd, np, sns, sm, xgb, torch]
for package in packages:
print(f"{str(package).partition('from')[0]} using version: {package.__version__}")
Python version: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] <module 'pandas' using version: 2.1.4 <module 'numpy' using version: 1.26.4 <module 'seaborn' using version: 0.13.2 <module 'statsmodels.api' using version: 0.14.1 <module 'xgboost' using version: 2.0.3 <module 'torch' using version: 2.2.2
# Set directories
current_wdir = os.getcwd()
data_folder_hourly = current_wdir + '/Data/Hourly_Load'
data_folder_weather = current_wdir + '/Data/Weather'
# Add and sort all filenames from each folder path.
file_path_load = [f'{data_folder_hourly}/{file}' for file in os.listdir(data_folder_hourly) if '.csv' in file]
file_path_load = sorted(file_path_load)
file_path_weather = [f'{data_folder_weather}/{file}' for file in os.listdir(data_folder_weather) if '.csv' in file]
file_path_weather = sorted(file_path_weather)
# Iterate through filenames and add them to dataframe.
# pd.read_csv can unzip as it goes with compression argument.
load_df = pd.concat([pd.read_csv(file, compression = 'gzip') for file in file_path_load], join = 'outer', ignore_index = False, axis = 0)
weather_df_list = []
for filename in file_path_weather:
# delimiter has whitespace after comma, specified here as \s+.
temp_df = pd.read_csv(filename, compression = 'gzip', delimiter = ',\s+', engine = 'python')
temp_df['zone']= os.path.basename(filename).replace('.csv.gz', '')
weather_df_list.append(temp_df)
weather_df = pd.concat(weather_df_list, join = 'outer', ignore_index = False, axis = 0)
display(load_df)
display(load_df.dtypes)
| datetime_beginning_utc | datetime_beginning_ept | nerc_region | mkt_region | zone | load_area | mw | is_verified | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1/1/1993 5:00:00 AM | 1/1/1993 12:00:00 AM | PJM RTO | PJM | BC | BC | 2358.000 | True |
| 1 | 1/1/1993 5:00:00 AM | 1/1/1993 12:00:00 AM | PJM RTO | PJM | CNCT | AE | 855.000 | True |
| 2 | 1/1/1993 5:00:00 AM | 1/1/1993 12:00:00 AM | PJM RTO | PJM | CNCT | DPL | 1150.000 | True |
| 3 | 1/1/1993 5:00:00 AM | 1/1/1993 12:00:00 AM | PJM RTO | PJM | GPU | JC | 1632.000 | True |
| 4 | 1/1/1993 5:00:00 AM | 1/1/1993 12:00:00 AM | PJM RTO | PJM | GPU | ME | 929.000 | True |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 62605 | 3/28/2024 3:00:00 AM | 3/27/2024 11:00:00 PM | RFC | MIDATL | RECO | RECO | 131.877 | False |
| 62606 | 3/28/2024 3:00:00 AM | 3/27/2024 11:00:00 PM | RFC | MIDATL | PEP | SMECO | 355.931 | False |
| 62607 | 3/28/2024 3:00:00 AM | 3/27/2024 11:00:00 PM | RFC | MIDATL | PL | UGI | 101.478 | True |
| 62608 | 3/28/2024 3:00:00 AM | 3/27/2024 11:00:00 PM | RFC | MIDATL | AE | VMEU | 70.051 | False |
| 62609 | 3/28/2024 3:00:00 AM | 3/27/2024 11:00:00 PM | RTO | RTO | RTO | RTO | 80180.923 | False |
5446243 rows × 8 columns
datetime_beginning_utc object datetime_beginning_ept object nerc_region object mkt_region object zone object load_area object mw float64 is_verified bool dtype: object
Some datatype reformatting will have to be done on the date.
display(weather_df)
display(weather_df.dtypes)
| Date | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | zone | |
|---|---|---|---|---|---|---|
| 0 | 2015-01-01 | 43 | 17 | 30.0 | 0.00 | AE |
| 1 | 2015-01-02 | 45 | 24 | 34.5 | 0.00 | AE |
| 2 | 2015-01-03 | 47 | 21 | 34.0 | 0.62 | AE |
| 3 | 2015-01-04 | 64 | 46 | 55.0 | 0.26 | AE |
| 4 | 2015-01-05 | 50 | 25 | 37.5 | 0.00 | AE |
| ... | ... | ... | ... | ... | ... | ... |
| 11417 | 2024-04-05 | 57 | 40 | 48.5 | 0.00 | PEP |
| 11418 | 2024-04-06 | 57 | 41 | 49.0 | 0.00 | PEP |
| 11419 | 2024-04-07 | 65 | 43 | 54.0 | 0.00 | PEP |
| 11420 | 2024-04-08 | 71 | 44 | 57.5 | 0.00 | PEP |
| 11421 | 2024-04-09 | 78 | 51 | 64.5 | 0.00 | PEP |
32400 rows × 6 columns
Date object MaxTemperature object MinTemperature object AvgTemperature object Precipitation object zone object dtype: object
All of these will potentially be useful as covariates in the models.
The sample rates between datasets are mismatched - electric load data is hourly and this weather data is daily. This will require either load resampling and aggregation or assigning this weather data to each hour as a static daily feature.
Convert the dates into datatime objects.
# Convert to datetime. The two dataframes share column names.
date_cols = ['datetime_beginning_utc', 'datetime_beginning_ept']
load_df[date_cols] = load_df[date_cols].apply(pd.to_datetime, format = '%m/%d/%Y %I:%M:%S %p', utc = False)
weather_df['Date'] = weather_df['Date'].apply(pd.to_datetime, format = '%Y-%m-%d', utc = False)
Since the zone column will likely be used to group the data by, we can start by grouping the data by zone and date and plot over time.
load_zone_df = load_df[['datetime_beginning_ept', 'zone', 'mw']].groupby(['datetime_beginning_ept', 'zone'], as_index = False).sum()
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(load_zone_df, x = 'datetime_beginning_ept', y = 'mw', hue = 'zone', alpha = 0.7, lw = 1)
Zone 'RTO' seems to be on a different scale and looks like an aggregate sum of all the zones. This is likely to be the case because 'RTO' stands for Regional Transmission Organization which is exactly where this data came from.
Let's confirm this by dropping the 'RTO' zone, summing the remaining zones and overlaying it on the previous plot.
load_noRTO_df = load_zone_df.drop(load_zone_df[load_zone_df['zone'] == 'RTO'].index)
ax, fig = plt.subplots(figsize = (14, 8))
# Black - Sum of ALL
ax = sns.lineplot(load_zone_df[['datetime_beginning_ept', 'mw']].groupby(['datetime_beginning_ept'], as_index = False).sum(), x = 'datetime_beginning_ept', y = 'mw', alpha = 0.7, lw = 1, c = 'black')
# Purple - Sum of All but NO RTO
ax = sns.lineplot(load_noRTO_df[['datetime_beginning_ept', 'mw']].groupby(['datetime_beginning_ept'], as_index = False).sum(), x = 'datetime_beginning_ept', y = 'mw', alpha = 0.7, lw = 1, c = 'purple')
ax = sns.lineplot(load_zone_df, x = 'datetime_beginning_ept', y = 'mw', hue = 'zone', alpha = 0.7, lw = 1)
Besides a few artifacts in the data, it looks like that was the case. We can get rid of RTO since forecasting an aggregated time series signal isn't the goal here.
Next, more information is needed about how many zones there are and where they are located.
sorted(load_df['zone'].unique())
['AE', 'AEP', 'AP', 'ATSI', 'BC', 'CE', 'CNCT', 'DAY', 'DEOK', 'DOM', 'DPL', 'DUQ', 'EKPC', 'GPU', 'JC', 'ME', 'OVEC', 'PE', 'PEP', 'PL', 'PN', 'PS', 'RECO', 'RTO']
Here we can manually look at these zones and research where they are located. This information is important to align the weather data. Most of them easily identified with a map from the data source (can be found in section 3.4).
Zones:
AE - New JerseyAEP - OhioAP - West VirginiaATSI - OhioBC - MarylandCE - Northern IllinoisCNCT - ConnecticutDAY - OhioDEOK - OhioDOM - VirginiaDPL - ConnecticutDUQ - PennsylvaniaEKPC - KentuckyGPU - New JerseyJC - New JerseyME - PennsylvaniaOVEC - OhioPE - PennsylvaniaPEP - MarylandPL - PennsylvaniaPN - PennsylvaniaPS - New JerseyRECO - New Jersey
RTO - Regional Transmission Organization (PJM) - Aggregated Sum
Forecasting ALL of these zones is beyond the scope and computational limitations of this project, because of this let's build out a function to visualize each zone and choose a few that are interesting that will be used to build the models and perform predictions on.
def load_zone_plot(zone_df, outliers=None, figsize=(16,16), cmap=None):
"""Finds all unique zone names and prints to output (n x 2) subplot array of line plots where x = datetime vs y = MW.
Can handle any length of list of strings.
Parameters:
zones_df (list): Pandas dataframe grouped by zone and time such as load_zone_df.
Optional Parameters:
outliers (dict): List of outlier indexes that will be plotted as points on top of line plot. Has the form of {'Zone' : [idx, idx, ...]}
figsize (tuple): Default = (16,16)
Returns:
None. Prints plot to output."""
zone_list = zone_df['zone'].unique()
color_pal = ['#005d5d', '#017d66']
fig, ax = plt.subplots(int(np.ceil(len(zone_list)/2)), 2, figsize = figsize, sharey = False, sharex = False)
for i, zone in enumerate(zone_list):
#print(f'{i} {zone} -- [{i//2}, {i%2}]')
# i iterates through axes rows and columns with // and % operators
# ([0,0] to [0,1] to [1,0] ... etc).
sns.lineplot(load_zone_df.loc[load_zone_df['zone'] == zone], ax = ax[i//2, i%2],
x = 'datetime_beginning_ept', y = 'mw',
alpha = 0.7, lw = 1,
# Alternates colors between rows.
c = cmap[zone] if cmap else (color_pal[int(i % 4 < 2)]))
ax[i//2, i%2].set_title(zone, y = 1)
ax[i//2, i%2].set_xlabel(None)
ax[i//2, i%2].set_ylabel(None)
# If given outlier indexes, plot them.
if outliers:
sns.scatterplot(load_zone_df.iloc[outliers.get(zone)], ax = ax[i//2, i%2],
x = 'datetime_beginning_ept', y = 'mw',
c = 'black')
plt.subplots_adjust(wspace = 0.3, hspace = 1)
fig.suptitle('Hourly Load of Each Zone')
fig.supxlabel('Datetime')
fig.supylabel('Megawatts (MW)')
fig.tight_layout()
# Check if odd number of plots and delete last subplot if true.
if len(zone_list) % 2 != 0:
fig.delaxes(ax[i // 2, (len(zone_list) % 2)])
return None
load_zone_plot(load_zone_df)
The following zones will be chosen for the remainder of this project.
AE: Atlantic City Electric Co. CE: ComEdDOM: DominionJC: Jersey Central Power & LightPEP: Potomac Electric Power Co.zones = ['AE', 'CE', 'DOM', 'JC', 'PEP']
load_zone_df = load_zone_df.loc[load_zone_df['zone'].isin(zones)]
load_zone_df.reset_index(drop = True, inplace = True)
# Color map zones for consistency.
palette = ['#570408', '#005d5d', '#1192e8', '#fa4d56', '#012749']
zones_palette = dict(zip(zones, palette))
Note about the weather data:
After choosing zones to work with, weather data was sourced from National Oceanic and Atmospheric Administration's (NOAA) Regional Climate Center (RCC) Applied Climate Information System (ACIS). Detailed information can be found in section 1. For continuity the weather data was imported above along with the other datasets after the zones were chosen.
Ideally, the sampling rate would match the hourly load data, however, only daily weather data is available at this time.
A weather station was chosen for each zone. The criteria considered for the weather station was:
Weather Stations Chosen:
AE: MILLVILLE MUNICIPAL AIRPORT, NJ CE: CHICAGO OHARE INTL AP, ILDOM: RICHMOND INTERNATIONAL AP, VAJC: NEW BRUNSWICK 3 SE, NJPEP: WASHINGTON REAGAN NATIONAL AIRPORT, VA
(Map Source: https://www.pjm.com/-/media/about-pjm/pjm-zones.ashx)
Some but not all of the models require the sample frequency to have no missing samples (in this case hours in load_df and days in weather_df - they will be joined later where the weather data will become hourly static features.).
Method:
# Could also use .asfreq('H') which would assign any missing rows to NaN then identify rows.
print('######### Check for Missing Weather Dates #########')
for zone in zones:
# Find min and max dates for that zone.
date_min = weather_df.loc[weather_df['zone'] == zone]['Date'].min()
date_max = weather_df.loc[weather_df['zone'] == zone]['Date'].max()
print(f'{zone} - {date_min} | {date_max}')
# Find missing dates.
# Weather DF - Set freq to 'D' for days.
miss_date = pd.date_range(start = date_min, end = date_max, freq = 'D').difference(weather_df['Date'])
print(f"{'':<5}Missing Dates: {miss_date if len(miss_date) > 0 else False}")
print('\n######### Check for Missing Energy Dates #########')
for zone in zones:
# Find min and max dates for that zone.
date_min = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].min()
date_max = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].max()
print(f'{zone} - {date_min} | {date_max}')
# Find missing dates.
# Load DF - Set freq to 'H' for hours.
miss_date = pd.date_range(start = date_min, end = date_max, freq = 'H').difference(
load_zone_df['datetime_beginning_ept'].loc[load_zone_df['zone'] == zone])
print(f"{'':<5}Missing Dates: {miss_date if len(miss_date) > 0 else False}")
# Check to see if missing dates exist.
# Fill missing dates with average of two surrounding times.
# TODO: There's definitely some optimizations to be had here.
if len(miss_date) > 0:
print(f"{'':<5}============================================================")
print(f"{'':<5}Adding Average of Surrounding Times to {zone} Missing Dates")
# Find index of one hour before and after missing
miss_pre_index = (load_zone_df.loc[(load_zone_df['zone'] == zone) &
(load_zone_df['datetime_beginning_ept'].isin(
miss_date - pd.Timedelta(hours = 1)))]).index
miss_post_index = (load_zone_df.loc[(load_zone_df['zone'] == zone) &
(load_zone_df['datetime_beginning_ept'].isin(
miss_date + pd.Timedelta(hours = 1)))]).index
# Calculate average of two surround times.
miss_avg = [np.round(np.mean((a,b)), 3) for a, b in
(zip(load_zone_df.iloc[miss_pre_index]['mw'], load_zone_df.iloc[miss_post_index]['mw']))]
# Create DF to hold values and concat to main DF.
temp_df = pd.DataFrame({'datetime_beginning_ept' : miss_date,
'zone' : zone,
'mw' : miss_avg})
load_zone_df = pd.concat([load_zone_df, temp_df], join = 'outer', ignore_index = False, axis = 0)
print(f"{'':<5}{'='*5} Done!\n")
######### Check for Missing Weather Dates #########
AE - 2015-01-01 00:00:00 | 2024-04-09 00:00:00
Missing Dates: False
CE - 2004-05-01 00:00:00 | 2024-04-09 00:00:00
Missing Dates: False
DOM - 2005-05-01 00:00:00 | 2024-04-09 00:00:00
Missing Dates: False
JC - 2015-01-01 00:00:00 | 2024-04-10 00:00:00
Missing Dates: False
PEP - 1993-01-01 00:00:00 | 2024-04-09 00:00:00
Missing Dates: False
######### Check for Missing Energy Dates #########
AE - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: DatetimeIndex(['2015-03-08 02:00:00', '2016-03-13 02:00:00',
'2017-03-12 02:00:00', '2018-03-11 02:00:00',
'2019-03-10 02:00:00', '2020-03-08 02:00:00',
'2021-03-14 02:00:00', '2022-03-13 02:00:00',
'2023-03-12 02:00:00', '2024-03-10 02:00:00'],
dtype='datetime64[ns]', freq=None)
============================================================
Adding Average of Surrounding Times to AE Missing Dates
===== Done!
CE - 2004-05-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: DatetimeIndex(['2005-04-03 02:00:00', '2006-04-02 02:00:00',
'2007-03-11 02:00:00', '2008-03-09 02:00:00',
'2009-03-08 02:00:00', '2010-03-14 02:00:00',
'2011-03-13 02:00:00', '2012-03-11 02:00:00',
'2013-03-10 02:00:00', '2014-03-09 02:00:00',
'2015-03-08 02:00:00', '2016-03-13 02:00:00',
'2017-03-12 02:00:00', '2018-03-11 02:00:00',
'2019-03-10 02:00:00', '2020-03-08 02:00:00',
'2021-03-14 02:00:00', '2022-03-13 02:00:00',
'2023-03-12 02:00:00', '2024-03-10 02:00:00'],
dtype='datetime64[ns]', freq=None)
============================================================
Adding Average of Surrounding Times to CE Missing Dates
===== Done!
DOM - 2005-05-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: DatetimeIndex(['2006-04-02 02:00:00', '2007-03-11 02:00:00',
'2008-03-09 02:00:00', '2009-03-08 02:00:00',
'2010-03-14 02:00:00', '2011-03-13 02:00:00',
'2012-03-11 02:00:00', '2013-03-10 02:00:00',
'2014-03-09 02:00:00', '2015-03-08 02:00:00',
'2016-03-13 02:00:00', '2017-03-12 02:00:00',
'2018-03-11 02:00:00', '2019-03-10 02:00:00',
'2020-03-08 02:00:00', '2021-03-14 02:00:00',
'2022-03-13 02:00:00', '2023-03-12 02:00:00',
'2024-03-10 02:00:00'],
dtype='datetime64[ns]', freq=None)
============================================================
Adding Average of Surrounding Times to DOM Missing Dates
===== Done!
JC - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: DatetimeIndex(['2015-03-08 02:00:00', '2016-03-13 02:00:00',
'2017-03-12 02:00:00', '2018-03-11 02:00:00',
'2019-03-10 02:00:00', '2020-03-08 02:00:00',
'2021-03-14 02:00:00', '2022-03-13 02:00:00',
'2023-03-12 02:00:00', '2024-03-10 02:00:00'],
dtype='datetime64[ns]', freq=None)
============================================================
Adding Average of Surrounding Times to JC Missing Dates
===== Done!
PEP - 1993-01-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: DatetimeIndex(['1993-04-04 02:00:00', '1994-04-03 02:00:00',
'1995-04-02 02:00:00', '1996-04-07 02:00:00',
'1997-04-06 02:00:00', '1998-04-05 02:00:00',
'1999-04-04 02:00:00', '2000-04-02 02:00:00',
'2001-04-01 02:00:00', '2002-04-07 02:00:00',
'2003-04-06 02:00:00', '2004-04-04 02:00:00',
'2005-04-03 02:00:00', '2006-04-02 02:00:00',
'2007-03-11 02:00:00', '2008-03-09 02:00:00',
'2009-03-08 02:00:00', '2010-03-14 02:00:00',
'2011-03-13 02:00:00', '2012-03-11 02:00:00',
'2013-03-10 02:00:00', '2014-03-09 02:00:00',
'2015-03-08 02:00:00', '2016-03-13 02:00:00',
'2017-03-12 02:00:00', '2018-03-11 02:00:00',
'2019-03-10 02:00:00', '2020-03-08 02:00:00',
'2021-03-14 02:00:00', '2022-03-13 02:00:00',
'2023-03-12 02:00:00', '2024-03-10 02:00:00'],
dtype='datetime64[ns]', freq=None)
============================================================
Adding Average of Surrounding Times to PEP Missing Dates
===== Done!
weather_df had no missing dates, so a solution to add date rows was not implemented.
Now to recheck load_df to make sure our data injection worked.
for zone in zones:
# Find min and max dates for that zone.
date_min = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].min()
date_max = load_zone_df.loc[load_zone_df['zone'] == zone]['datetime_beginning_ept'].max()
print(f'{zone} - {date_min} | {date_max}')
# Find missing dates.
# Load DF - Set freq to 'H' for hours.
miss_date = pd.date_range(start = date_min, end = date_max, freq = '1H').difference(
load_zone_df['datetime_beginning_ept'].loc[load_zone_df['zone'] == zone])
print(f"{'':<5}Missing Dates: {miss_date if len(miss_date) > 0 else False}")
# Sort by date and reset index.
load_zone_df = load_zone_df.sort_values(by = 'datetime_beginning_ept').reset_index(drop = True)
AE - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: False
CE - 2004-05-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: False
DOM - 2005-05-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: False
JC - 2015-01-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: False
PEP - 1993-01-01 00:00:00 | 2024-03-27 23:00:00
Missing Dates: False
Though the weather dataset was not missing any days, according to NOAA, the data logged may have flags for certain circumstances or measurement errors.
According to their glossary, weather station data may contain the following flags:
These values need to be located and handled in different ways.
# Find unique non-numeric labels and assign values depending on col
for col in weather_df.columns[1:5]:
search_str = set(weather_df[col].unique().astype(str))
unique_strings = {x for x in search_str if x.isalpha()}
print(f"{col} | Unique Strings: {unique_strings if len(unique_strings) > 0 else False}")
if col == 'Precipitation':
weather_df[col] = weather_df[col].mask(weather_df[col] == 'T', '0.01') # Trace
weather_df[col] = weather_df[col].mask(weather_df[col] == 'M', '0.00') # Missing
continue # Move on to next iteration.
# Can handle Min, Max, Avg columns the same.
# Take surrounding 6 values (by date) and average them to get missing value.
m_index = weather_df.loc[weather_df[col] == 'M']
# TODO: Perform and exorcism on this section....
# Not ideal, could arise that Max < Min, or Avg isn't calc from Min and Max, etc.
# AE is the only zone with M values, hard coded for now.
m_df = pd.DataFrame({
'3' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=3))) & (weather_df['zone'] == 'AE')].to_list(),
'2' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=2))) & (weather_df['zone'] == 'AE')].to_list(),
'1' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=1))) & (weather_df['zone'] == 'AE')].to_list(),
'0' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] + pd.Timedelta(days=0))) & (weather_df['zone'] == 'AE')].to_list(),
'-1' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] - pd.Timedelta(days=1))) & (weather_df['zone'] == 'AE')].to_list(),
'-2' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] - pd.Timedelta(days=2))) & (weather_df['zone'] == 'AE')].to_list(),
'-3' : weather_df[col].loc[(weather_df['Date'].isin(m_index['Date'] - pd.Timedelta(days=3))) & (weather_df['zone'] == 'AE')].to_list(),
},index = m_index.index)
m_df = m_df.apply(pd.to_numeric, errors = 'coerce')
m_df['new_values'] = (np.round(m_df.mean(axis = 1)))
# Assign new average to missing cells.
weather_df.iloc[m_df.index, weather_df.columns.get_loc(col)] = m_df['new_values']
# Return original dtypes to columns and reset index.
weather_df['MaxTemperature'] = weather_df['MaxTemperature'].astype('int')
weather_df['MinTemperature'] = weather_df['MinTemperature'].astype('int')
weather_df.reset_index(drop = True, inplace = True)
MaxTemperature | Unique Strings: {'M'}
MinTemperature | Unique Strings: {'M'}
AvgTemperature | Unique Strings: {'M'}
Precipitation | Unique Strings: {'M', 'T'}
To quickly check if any remain, the numeric columns can be cast as numeric type and any remaining string characters will automatically be change to NaN. Then each will be checked for any NaN values.
# Change numeric columns to numeric and coerce all remaining errors/non-numbers to NaN.
weather_df[['MaxTemperature', 'MinTemperature', 'AvgTemperature', 'Precipitation']] = weather_df[
['MaxTemperature', 'MinTemperature', 'AvgTemperature', 'Precipitation']].apply(pd.to_numeric, errors = 'coerce')
weather_df.isna().sum()
Date 0 MaxTemperature 0 MinTemperature 0 AvgTemperature 0 Precipitation 0 zone 0 dtype: int64
Looks like it has been cleaned up.
Now that they are in their proper numeric format, .describe() can be used to get a better picture of their distributions.
display(weather_df.describe())
weather_df.dtypes
| Date | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | |
|---|---|---|---|---|---|
| count | 32400 | 32400.000000 | 32400.000000 | 32400.000000 | 32400.000000 |
| mean | 2013-07-11 13:54:42.666666496 | 66.006204 | 47.391914 | 56.698920 | 0.121390 |
| min | 1993-01-01 00:00:00 | -10.000000 | -23.000000 | -16.500000 | 0.000000 |
| 25% | 2008-04-12 00:00:00 | 51.000000 | 34.000000 | 42.500000 | 0.000000 |
| 50% | 2015-05-28 00:00:00 | 68.000000 | 48.000000 | 57.500000 | 0.000000 |
| 75% | 2019-11-03 00:00:00 | 82.000000 | 63.000000 | 72.500000 | 0.050000 |
| max | 2024-04-10 00:00:00 | 105.000000 | 84.000000 | 93.500000 | 7.610000 |
| std | NaN | 18.895985 | 17.544789 | 17.896425 | 0.340688 |
Date datetime64[ns] MaxTemperature int64 MinTemperature int64 AvgTemperature float64 Precipitation float64 zone object dtype: object
Nothing looks out of the ordinary now.
From the previous plots a few outliers could be seen. No matter the source of these outliers, be it systematic like scheduled maintenance, system downtime after a storm, or measurement errors, it will be important to remove them before training the models.
First, let's take a closer look at the chosen zones.
load_zone_plot(load_zone_df, figsize = (14,14), cmap = zones_palette)
PEP, DOM, and AE each have obvious outliers. A method to detect them needs to be chosen. It seems a quantile threshold approach could be effective in this case.
Here a test an be done to get an idea how far the threshold needs to be set by listing small slices of quantiles.
load_zone_df.loc[load_zone_df['zone'] == 'PEP']['mw'].quantile([0.00005, 0.01, 0.99, 0.9995])
0.00005 1755.867458 0.01000 2029.000000 0.99000 5626.516500 0.99950 6445.806363 Name: mw, dtype: float64
This looks right because from the PEP plot it can be seen most of the obvious outliers are somewhere below 1800 mw.
Here a detection method will be implemented.
mw value within that 35 day window.mw outside the threshold.Along with this method a z-score method will be implemented as well and whichever method seems to capture more of the true outliers, will be used.
Z-Score Method:
outliers_index = {}
outliers_z = {}
replacement_medians = {}
for zone in zones:
outliers_index[zone] = []
outliers_z[zone] = []
zone_df = load_zone_df.loc[load_zone_df['zone'] == zone]
# Find min and max dates for that zone.
date_min = zone_df['datetime_beginning_ept'].min()
date_max = zone_df['datetime_beginning_ept'].max()
# Create tumbling window to locally detect outliers.
window_size = 35 # Days
left = date_min
right = left + pd.Timedelta(days = window_size)
# Tumbling window calculates outlier thresholds
# Two methods: Quantile_Thresh = (median +/- (scalar*SD)) or Zscore = z >/< thresh.
# Wasn't aware of the df.rolling() window method until after, could use instead.
for i in range(int(((date_max-date_min) / pd.Timedelta(days = window_size)) + 1)):
if right > date_max:
right = date_max
window_df = zone_df.loc[(zone_df['datetime_beginning_ept'].between(left, right, inclusive = 'both'))]
w_std = window_df['mw'].std()
w_median = window_df['mw'].median()
q_lower, q_upper = w_median - (5.75*w_std), w_median + (5.75*w_std) # quantile
w_mean = window_df['mw'].mean()
w_zscore = (window_df['mw'] - w_mean) / w_std # zscore
# Quantile and zscore methods.
outliers_temp = list(window_df.loc[(window_df['mw'] < q_lower) | (window_df['mw'] > q_upper)].index)
outliers_z_temp = list(w_zscore.loc[(w_zscore < -5.5) | (w_zscore > 5.5)].index)
# Store outliers if detected.
if len(outliers_temp) > 0:
outliers_index[zone].extend(outliers_temp)
for i in outliers_temp:
replacement_medians[i] = round(w_median, 3)
if len(outliers_z_temp) > 0:
outliers_z[zone].extend(outliers_z_temp)
# Iterate window pointers
left = right + pd.Timedelta(hours = 1)
right = right + pd.Timedelta(days = window_size)
outliers_z
{'AE': [598305, 626861, 671381, 715061, 730185],
'CE': [366819, 583181, 626862, 671382, 758744],
'DOM': [129939, 261988, 288196, 314405, 626863, 671383, 715063, 758745],
'JC': [626864, 671384],
'PEP': [598301, 642824, 671385, 730181, 758741]}
outliers_index
{'AE': [598305, 626861, 671381, 715061, 730185],
'CE': [583181, 626862, 758744],
'DOM': [261988, 626863, 671383, 715063, 758745],
'JC': [],
'PEP': [598301, 642824, 730181]}
The load_zone_plot function was altered to take the index of any outliers and plot them with a black point.
load_zone_plot(load_zone_df, outliers_index, figsize = (14,14), cmap = zones_palette)
The quantile method was chosen as it seemed to calibrate to a better result with this data.
Now to replace the found outliers with the median of the window that was stored earlier.
for zone in zones:
display(load_zone_df.iloc[outliers_index.get(zone)])
load_zone_df.loc[replacement_medians.keys(), 'mw'] = list(replacement_medians.values())
# for zone, index in outliers_index.items():
# load_zone_df.drop(outliers_index.get(zone), inplace = True)
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 598305 | 2020-03-08 01:00:00 | AE | 63.719 |
| 626861 | 2020-11-01 01:00:00 | AE | 1731.123 |
| 671381 | 2021-11-07 01:00:00 | AE | 2180.259 |
| 715061 | 2022-11-06 01:00:00 | AE | 1665.363 |
| 730185 | 2023-03-12 01:00:00 | AE | 68.317 |
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 583181 | 2019-11-03 01:00:00 | CE | 17586.314 |
| 626862 | 2020-11-01 01:00:00 | CE | 15989.914 |
| 758744 | 2023-11-05 01:00:00 | CE | 16491.537 |
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 261988 | 2010-11-07 01:00:00 | DOM | 17494.000 |
| 626863 | 2020-11-01 01:00:00 | DOM | 18249.653 |
| 671383 | 2021-11-07 01:00:00 | DOM | 20633.260 |
| 715063 | 2022-11-06 01:00:00 | DOM | 19902.046 |
| 758745 | 2023-11-05 01:00:00 | DOM | 21330.288 |
| datetime_beginning_ept | zone | mw |
|---|
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 598301 | 2020-03-08 01:00:00 | PEP | 431.527 |
| 642824 | 2021-03-14 01:00:00 | PEP | 354.625 |
| 730181 | 2023-03-12 01:00:00 | PEP | 427.420 |
# To double-check that they were properly replaced.
# load_zone_df.loc[replacement_medians.keys()]
After reviewing the plot after outlier replacement, a few remaining data points were still there. Likely falling within the threshold because the previously removed outliers were skewing the distribution. The previous code could be altered to rerun and capture a second or third-wave but the remaining outliers could also just be captured by visuals and sorting the data.
The remainder will be manually removed for now and replaced with the median using the same process as before.
display(load_zone_df.loc[load_zone_df['zone'] == 'AE'].sort_values('mw', ascending = True).head(3))
display(load_zone_df.loc[load_zone_df['zone'] == 'PEP'].sort_values('mw', ascending = True).head(5))
display(load_zone_df.loc[load_zone_df['zone'] == 'DOM'].sort_values('mw', ascending = True).head(3))
outliers_manual = [642823, 642830, 53319, 93914, 93915, 93916, 367978, 598310, 730188, 93913, 93917, 93912]
manual_medians = []
for outlier in outliers_manual:
zone = load_zone_df.loc[outlier]['zone']
zone_df = load_zone_df.loc[load_zone_df['zone'] == zone]
# Find min and max dates for that zone.
date_min = zone_df['datetime_beginning_ept'].min()
date_max = zone_df['datetime_beginning_ept'].max()
# Create indow to calculate median.
left = zone_df.loc[outlier]['datetime_beginning_ept'] - pd.Timedelta(days = 17)
right = zone_df.loc[outlier]['datetime_beginning_ept'] + pd.Timedelta(days = 17)
window_df = zone_df.loc[(zone_df['datetime_beginning_ept'].between(left, right, inclusive = 'both'))]
w_median = window_df['mw'].median()
manual_medians.append(round(w_median, 3))
load_zone_df.loc[outliers_manual, 'mw'] = manual_medians
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 642823 | 2021-03-14 01:00:00 | AE | 62.426 |
| 605926 | 2020-05-10 14:00:00 | AE | 475.194 |
| 605923 | 2020-05-10 13:00:00 | AE | 475.531 |
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 642830 | 2021-03-14 02:00:00 | PEP | 1406.669 |
| 53319 | 1999-01-31 15:00:00 | PEP | 1505.000 |
| 93914 | 2003-09-19 02:00:00 | PEP | 1578.554 |
| 93915 | 2003-09-19 03:00:00 | PEP | 1591.057 |
| 93916 | 2003-09-19 04:00:00 | PEP | 1609.728 |
| datetime_beginning_ept | zone | mw | |
|---|---|---|---|
| 367978 | 2014-11-18 03:00:00 | DOM | 4724.204 |
| 283165 | 2011-08-28 04:00:00 | DOM | 5516.000 |
| 283162 | 2011-08-28 03:00:00 | DOM | 5519.000 |
# Re-check
# load_zone_df.loc[outliers_manual]
Now to plot the results of the efforts.
load_zone_plot(load_zone_df, figsize = (14,14), cmap = zones_palette)
Looks much better!
It is now time to join the datasets into one and duplicate the weather data to fit into the load dataset's hourly frequency.
load_zone_df.rename(columns = {'datetime_beginning_ept' : 'Date'}, inplace = True)
load_zone_df.reset_index(drop = True, inplace = True)
# Merge
load_df = load_zone_df.merge(weather_df, on = ['Date', 'zone'], how = 'left')
# Sort values by zone and date so we can use .ffill()
load_df.sort_values(by = ['zone', 'Date'], inplace = True)
display(load_df)
# Fill forward weather data up until next day.
load_df.ffill(inplace = True)
# Confirm fill worked.
display(load_df.tail(26))
| Date | zone | mw | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | |
|---|---|---|---|---|---|---|---|
| 371136 | 2015-01-01 00:00:00 | AE | 1180.082 | 43.0 | 17.0 | 30.0 | 0.0 |
| 371143 | 2015-01-01 01:00:00 | AE | 1129.988 | NaN | NaN | NaN | NaN |
| 371149 | 2015-01-01 02:00:00 | AE | 1088.828 | NaN | NaN | NaN | NaN |
| 371151 | 2015-01-01 03:00:00 | AE | 1064.563 | NaN | NaN | NaN | NaN |
| 371157 | 2015-01-01 04:00:00 | AE | 1060.820 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 775991 | 2024-03-27 19:00:00 | PEP | 3202.078 | NaN | NaN | NaN | NaN |
| 776000 | 2024-03-27 20:00:00 | PEP | 3156.612 | NaN | NaN | NaN | NaN |
| 776005 | 2024-03-27 21:00:00 | PEP | 3035.994 | NaN | NaN | NaN | NaN |
| 776009 | 2024-03-27 22:00:00 | PEP | 2861.596 | NaN | NaN | NaN | NaN |
| 776015 | 2024-03-27 23:00:00 | PEP | 2698.059 | NaN | NaN | NaN | NaN |
776016 rows × 7 columns
| Date | zone | mw | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | |
|---|---|---|---|---|---|---|---|
| 775886 | 2024-03-26 22:00:00 | PEP | 2751.947 | 61.0 | 39.0 | 50.0 | 0.01 |
| 775895 | 2024-03-26 23:00:00 | PEP | 2608.338 | 61.0 | 39.0 | 50.0 | 0.01 |
| 775897 | 2024-03-27 00:00:00 | PEP | 2451.982 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775905 | 2024-03-27 01:00:00 | PEP | 2363.249 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775910 | 2024-03-27 02:00:00 | PEP | 2337.508 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775912 | 2024-03-27 03:00:00 | PEP | 2339.010 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775920 | 2024-03-27 04:00:00 | PEP | 2426.600 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775922 | 2024-03-27 05:00:00 | PEP | 2626.287 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775930 | 2024-03-27 06:00:00 | PEP | 2933.378 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775932 | 2024-03-27 07:00:00 | PEP | 3149.023 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775940 | 2024-03-27 08:00:00 | PEP | 3254.780 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775942 | 2024-03-27 09:00:00 | PEP | 3250.445 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775949 | 2024-03-27 10:00:00 | PEP | 3259.116 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775955 | 2024-03-27 11:00:00 | PEP | 3284.213 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775959 | 2024-03-27 12:00:00 | PEP | 3279.457 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775965 | 2024-03-27 13:00:00 | PEP | 3307.597 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775967 | 2024-03-27 14:00:00 | PEP | 3300.685 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775975 | 2024-03-27 15:00:00 | PEP | 3211.159 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775976 | 2024-03-27 16:00:00 | PEP | 3205.710 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775984 | 2024-03-27 17:00:00 | PEP | 3216.504 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775990 | 2024-03-27 18:00:00 | PEP | 3189.807 | 50.0 | 46.0 | 48.0 | 0.78 |
| 775991 | 2024-03-27 19:00:00 | PEP | 3202.078 | 50.0 | 46.0 | 48.0 | 0.78 |
| 776000 | 2024-03-27 20:00:00 | PEP | 3156.612 | 50.0 | 46.0 | 48.0 | 0.78 |
| 776005 | 2024-03-27 21:00:00 | PEP | 3035.994 | 50.0 | 46.0 | 48.0 | 0.78 |
| 776009 | 2024-03-27 22:00:00 | PEP | 2861.596 | 50.0 | 46.0 | 48.0 | 0.78 |
| 776015 | 2024-03-27 23:00:00 | PEP | 2698.059 | 50.0 | 46.0 | 48.0 | 0.78 |
As we can see, after joining the fill-forward method was used to fill in each hour of the same date with the same weather data, until the next day.
Many of the upcoming models will benefit by adding additional features as exogenous covariates.
Specific intervals of features such as lag, rolling mean, derivatives need to be considered carefully when used as exogenous features calculated from the target variable. This can be a great source of data leakage if these are not data you would have in a prediction setting.
For now, the majority of features and especially the shorter length interval features will be excluded to prevent data leakage, but will still remain implemented below (commented out as appropriate) as an exercise in time series feature engineering.
Even though the data is a timeseries with a datetime feature, it is important to extract that data into separate categories as there is certainly multiple seasonalities within this dataset.
For example, each day the load oscillates as people move throughout the day, but also each season likely has its own effect. Separating these features is an important step in capturing the different seasonal periods in time series forecasting.
Hour, Day, Month, Year, and Day of Year (1-365) will all be extracted.
load_df['Hour'] = load_df['Date'].dt.hour
load_df['Day'] = load_df['Date'].dt.day_of_week
day_map = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
load_df['Month'] = load_df['Date'].dt.month
month_map = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
load_df['Year'] = load_df['Date'].dt.year
load_df['Day_of_Year'] = load_df['Date'].dt.day_of_year
load_df.head(3)
| Date | zone | mw | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | Hour | Day | Month | Year | Day_of_Year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 371136 | 2015-01-01 00:00:00 | AE | 1180.082 | 43.0 | 17.0 | 30.0 | 0.0 | 0 | 3 | 1 | 2015 | 1 |
| 371143 | 2015-01-01 01:00:00 | AE | 1129.988 | 43.0 | 17.0 | 30.0 | 0.0 | 1 | 3 | 1 | 2015 | 1 |
| 371149 | 2015-01-01 02:00:00 | AE | 1088.828 | 43.0 | 17.0 | 30.0 | 0.0 | 2 | 3 | 1 | 2015 | 1 |
Rolling means are a very effecting feature in time series forecasting by smoothing the signal, helping models find real trends. The window iterates through the dataset by first initializing a window size of data points which are either centered or behind, and then the average is taken of that window.
Two rolling means will be added.
for zone in zones:
mask = load_df['zone'] == zone
temp_df = load_df.loc[mask]
# # 3 Hour Window
# load_df.loc[mask, 'Rolling_Mean_3'] = load_df[mask]['mw'].rolling(3).mean()
# # 6 Hour Window
# load_df.loc[mask, 'Rolling_Mean_6'] = load_df[mask]['mw'].rolling(6).mean()
# 1 Month Window
load_df.loc[mask, 'Rolling_Mean_1M'] = load_df[mask]['mw'].rolling(24*30).mean()
Lag of the target variable, MW, at different intervals. Lagged features are effective when correlation between the target variable at one time is correlated to a time a certain "lagged" distance in the past. This helps identify trends that have seasonality periods equal to the chosen lag value.
For example, a lag 24 of MW would be the value of MW 24 hours in the past.
Lags of 1, 6, 24, 24 7 (1 Week), 24 30 (1 Month) will be included.
All the way up to a year would be useful, however, it creates rows with empty NaN values the same size of the lag because it has to start that far into the future. These rows will have to be removed as some of the models cannot handle NaNs/no values.
Note: Autocorrelation Functions (ACF) and Partial Autocorrelation Functions (PACF) are great tools to plot and identify which lags have significant correlation in the data. These will be analyzed later on in the SARIMAX model section and might alter what is done in this cell.
# Lag Features
# Create mask for each zone since shift uses index and
# needs to reset shift at each zone.
for zone in zones:
mask = load_df['zone'] == zone
temp_df = load_df.loc[mask]
# # 1 Hour
# load_df.loc[mask, 'Lag_Hour'] = load_df[mask]['mw'].shift(1)
# # 6 Hours
# load_df.loc[mask, 'Lag_6_Hours'] = load_df[mask]['mw'].shift(6)
# # 24 Hours
# load_df.loc[mask, 'Lag_Day'] = load_df[mask]['mw'].shift(24)
# # 1 Week
# load_df.loc[mask, 'Lag_Week'] = load_df[mask]['mw'].shift(24*7)
# 1 Month
load_df.loc[mask, 'Lag_Month'] = load_df[mask]['mw'].shift(24*30)
display(load_df.head(2))
display(load_df.tail(2))
| Date | zone | mw | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | Hour | Day | Month | Year | Day_of_Year | Rolling_Mean_1M | Lag_Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 371136 | 2015-01-01 00:00:00 | AE | 1180.082 | 43.0 | 17.0 | 30.0 | 0.0 | 0 | 3 | 1 | 2015 | 1 | NaN | NaN |
| 371143 | 2015-01-01 01:00:00 | AE | 1129.988 | 43.0 | 17.0 | 30.0 | 0.0 | 1 | 3 | 1 | 2015 | 1 | NaN | NaN |
| Date | zone | mw | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | Hour | Day | Month | Year | Day_of_Year | Rolling_Mean_1M | Lag_Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 776009 | 2024-03-27 22:00:00 | PEP | 2861.596 | 50.0 | 46.0 | 48.0 | 0.78 | 22 | 2 | 3 | 2024 | 87 | 2831.742881 | 2787.267 |
| 776015 | 2024-03-27 23:00:00 | PEP | 2698.059 | 50.0 | 46.0 | 48.0 | 0.78 | 23 | 2 | 3 | 2024 | 87 | 2831.877457 | 2601.164 |
Fourier transforms are incredibly powerful at decomposing signals, in this instance helping identify and separate seasonal periodicity.
By taking a Fast Fourier Transform (FFT) we can then calculate the spectral density and analyze the frequencies that are important in our data. This information alone is useful as a feature and can be used directly as it will give seasonal periods high magnitude values at their peaks, but also will guide some other aspects of this project by identifying all potential seasonality periods.
Note this is a huge subject so it won't be tackled here but fourier transform terms can also be used to extend some models that are only designed to capture one seasonal period into capturing multiples.
for zone in zones:
mask = load_df['zone'] == zone
temp_df = load_df.loc[mask]
# Perform fast fourier transform and take abs() to get spectral density.
load_df.loc[mask, 'FFT'] = np.abs(fft(temp_df['mw'].values)) ** 2
# Assign sample rate and find frequencies for plot.
# d is sample rate. 1 = samples / hour.
load_df.loc[mask, 'FFT_Freq'] = np.fft.fftfreq(temp_df['mw'].size, d = 1)
load_df.tail(2)
| Date | zone | mw | MaxTemperature | MinTemperature | AvgTemperature | Precipitation | Hour | Day | Month | Year | Day_of_Year | Rolling_Mean_1M | Lag_Month | FFT | FFT_Freq | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 776009 | 2024-03-27 22:00:00 | PEP | 2861.596 | 50.0 | 46.0 | 48.0 | 0.78 | 22 | 2 | 3 | 2024 | 87 | 2831.742881 | 2787.267 | 8.072594e+13 | -0.000007 |
| 776015 | 2024-03-27 23:00:00 | PEP | 2698.059 | 50.0 | 46.0 | 48.0 | 0.78 | 23 | 2 | 3 | 2024 | 87 | 2831.877457 | 2601.164 | 1.734200e+15 | -0.000004 |
You can use the frequency axis to find the corresponding periods by taking the reciprocal of the frequency (period = 1 / frequency) which was assigned above in np.fft.fftfreq() with d =.
For example, if you find a peak at frequency f, the corresponding period would be 1/f. This will be used to identify seasonal periods, but for better interpretability f will be multiplied by 24 to change the sample unit to days.
for zone in zones:
plt.subplots(figsize = (12,2))
sns.lineplot(load_df.loc[load_df['zone'] == zone], y = 'FFT', x = 'FFT_Freq', c = zones_palette[zone])
# Assign y limits based on zone to better visualize.
if zone in ['AE', 'JC']:
plt.ylim(bottom = 0, top = 10000000000000)
else:
plt.ylim(bottom = 0, top = 1000000000000000)
# plt.xlim(left = 0) # Negatives are redundant.
plt.xlabel('Frequency (Hz) - Sample / Hour')
plt.ylabel('FFT Power Spectral Density')
plt.title(f'{zone} - Fourier Transform | Cycles')
# List top repeating cycles.
# Find top FFT values.
top_freq = load_df.loc[load_df['zone'] == zone]['FFT'].sort_values(ascending = False).index[1:13:2]
# Transform frequency to period of 1 day for interpretability
top_cycle_days = list(np.round(1 / (np.abs(load_df.loc[top_freq]['FFT_Freq'].astype(float))*24)))
text_box = ['\n' + (str(x)) for x in top_cycle_days]
plt.text(x = 0.5, y = -0.02,
s= f"Peak Frequencies \n (Hz to Days) {' '.join(text_box)}",
transform = ax.transAxes,
bbox = dict(facecolor = zones_palette[zone], edgecolor = 'black', boxstyle = 'round', alpha = 0.5))
plt.show()
Scales were adjusted on this graph heavily to get a better picture and may vary between zones even as they have different magnitudes.
We can see most of the different zones share peak magnitude frequencies at approximately:
These represent the strongest seasonal periods in the time series data.
Some values are duplicated because they are adjacent frequencies that were rounded.
Sometimes called discrete derivatives or differencing, this method is analogous to derivatives but using discrete time series data. You often see this used as a standard metric financial/economic settings.
for zone in zones:
mask = load_df['zone'] == zone
temp_df = load_df.loc[mask]
# 1st Order
load_df.loc[mask, 'Deriv_1'] = load_df[mask]['mw'].diff()
# 2nd Order
load_df.loc[mask, 'Deriv_2'] = load_df[mask]['mw'].diff().diff()
for zone in zones:
fig, ax = plt.subplots(3,1, figsize = (12,4), sharey = False, sharex = True)
sns.lineplot(load_df.loc[load_df['zone'] == zone], ax = ax[0], y = 'mw', x = 'Date', c = zones_palette[zone], alpha = 0.8)
sns.lineplot(load_df.loc[load_df['zone'] == zone], ax = ax[1], y = 'Deriv_1', x = 'Date', c = 'black', alpha = 0.8)
sns.lineplot(load_df.loc[load_df['zone'] == zone], ax = ax[2], y = 'Deriv_2', x = 'Date', c = 'orange', alpha = 0.8)
fig.suptitle(f"{zone} - Derivatives")
fig.subplots_adjust(hspace = 0.1)
# Remove from feature list, to prevent leakage.
# Until another solution is implemented.
load_df = load_df.drop(['Deriv_1', 'Deriv_2'], axis = 1)
def plot_daterange(date_min, date_max, zone_select):
"""Plots MW vs Date in given date range and zone. Interval starts and ends at midnight.
Note: Some of the text positions break down with larger ranges.
Ideal range is two-weeks.
Parameters:
date_min (string): Date of beginning of interval (inclusive).
date_max (string): Date of end of interval (inclusive).
zone_select (string): Zone from load_zone_df to plot.
Returns:
None. Prints plot to output."""
date_min = pd.to_datetime(f'{date_min} 00:00:00')
date_max = pd.to_datetime(f'{date_max} 00:00:00')
date_range = (date_max - date_min).days
weather_temp = weather_df.loc[(weather_df['Date'].between(date_min, date_max, inclusive='both')) &
(weather_df['zone'] == zone_select)].reset_index(drop = True)
plt.subplots(figsize = (14, 8))
sns.lineplot(load_zone_df.loc[(load_zone_df['Date'].between(date_min, date_max, inclusive='both')) &
(load_zone_df['zone'] == zone_select)],
x = 'Date',
y = 'mw',
c = zones_palette[zone_select],
alpha = 0.7, lw = 1, marker = 'o')
# Iterates through each day in range.
for i in range(date_range + 1):
# Day separating line.
plt.axvline(date_min + pd.Timedelta(days = i), color = 'purple', ls = '--')
if i > date_range-1:
continue
# Day of the week.
plt.text(x = date_min + pd.Timedelta(days = i) + pd.Timedelta(hours = 10),
y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
s = f'{(date_min + pd.Timedelta(days = i)).day_name()}',
rotation = 'vertical',
weight='bold',
alpha = 0.75)
# Temperatures.
# High
plt.text(x = date_min + pd.Timedelta(days = i) + pd.Timedelta(hours = 6),
y = ((plt.ylim()[0] + plt.ylim()[1]) / 2) + ((plt.ylim()[0] + plt.ylim()[1]) / 8),
s = f"Hi:{weather_temp['MaxTemperature'].iloc[i]}\N{DEGREE SIGN}",
rotation = 'horizontal',
alpha = 0.75)
# Low
plt.text(x = date_min + pd.Timedelta(days = i) + pd.Timedelta(hours = 6),
y = ((plt.ylim()[0] + plt.ylim()[1]) / 2) - ((plt.ylim()[0] + plt.ylim()[1]) / 8),
s = f"Lo:{weather_temp['MinTemperature'].iloc[i]}\N{DEGREE SIGN}",
rotation = 'horizontal',
alpha = 0.75)
return None
plot_daterange('8/01/2023', '8/16/2023', 'DOM')
This is about as expected, load is lowest at night and rises throughout the day. Data is from August on the East Coast of the US which tends to be their hottest month.
Things to note:
plot_daterange('01/07/2024', '01/22/2024', 'CE')
In contrast to the last plot, this is data from Chicago in one of the coldest months, January.
load_day_df = load_zone_df.copy(deep = True)
load_day_df['Date'] = pd.to_datetime(load_day_df['Date'].dt.date)
load_day_df = load_day_df.groupby(['Date', 'zone'], as_index = False).sum()
#load_day_df['mw_norm'] = float
for zone in zones:
mask = load_day_df['zone'] == zone
temp_df = load_day_df.loc[mask]
min_mw = temp_df['mw'].min()
max_mw = temp_df['mw'].max()
load_day_df.loc[mask, 'mw_norm'] = (temp_df['mw'] - min_mw) / max_mw
palette = ['#570408', '#005d5d', '#1192e8', '#fa4d56', '#012749']
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
plt.subplots(figsize = (12, 6))
sns.boxplot(load_day_df,
x = (load_day_df['Date'].dt.day_name()),
y = load_day_df['mw_norm'],
hue = 'zone',
order = order,
palette = sns.color_palette(palette, 5),
saturation = 0.65)
plt.grid(alpha = 0.5)
Saturday and Sunday do seem to be lower but not by a significant amount.
# Month
plt.subplots(figsize = (14, 6))
sns.boxplot(load_day_df,
x = (load_day_df['Date'].dt.month_name()),
y = load_day_df['mw_norm'],
hue = 'zone',
palette = sns.color_palette(palette, 5),
saturation = 0.65)
plt.grid(alpha = 0.5)
Summer months not only have the most variance but also highest load demands.
Here, a histogram plot of MW of all zones will be made.
fig, ax = plt.subplots(1, 5, figsize=(12, 4), sharey=True)
for i, zone in enumerate(zones):
temp_df = load_df.query("zone == @zone")['mw']
sns.histplot(temp_df, ax = ax[i], color = zones_palette[zone])
ax[i].axvline(temp_df.median(), c = 'black', alpha = 0.6)
ax[i].axvline(temp_df.mean(), c = 'red', alpha = 0.6)
ax[i].set_title(zone)
fig.suptitle('Histogram of MW')
fig.tight_layout()
fig, ax = plt.subplots(1, 5, figsize=(12, 4), sharey=True)
for i, zone in enumerate(zones):
temp_df = load_df.query("zone == @zone")['mw']
sns.histplot(temp_df, ax = ax[i], color = zones_palette[zone])
#ax[i].axvline(temp_df.mean(), c = 'red', alpha = 0.6)
quant = load_df.query("zone == @zone")['mw'].quantile(np.linspace(0,1,5))
for quantile in quant:
ax[i].axvline(quantile, c = 'purple', alpha = 0.6, ls = '--')
ax[i].set_title(zone)
ax[i].axvline(temp_df.median(), c = 'black', alpha = 0.7)
ax[i].axvspan(ax[i].get_xlim()[0], quant.iloc[0], facecolor='#b35d24', alpha = 0.4)
ax[i].axvspan(quant.iloc[-1], ax[i].get_xlim()[1], facecolor='#b35d24', alpha = 0.4)
ax[i].margins(x = 0)
fig.suptitle('Histogram of MW')
fig.tight_layout()
See the beginning of section 7.2. for a few time series specific analyses.
To avoid data leakage the test set will always be assigned the most recent dates with zero overlap in train or validation sets.
Since each zone has different beginning dates but share the same end dates this can be used to split each zone in the same way.
Important Note: In a decision-making setting, this split should have occurred before the EDA section to avoid data-leakage/snooping. This was not done in this case as this is just an exercise.
# Remove any NaN rows as some models can't handle them.
load_df = load_df[~load_df.isnull().any(axis = 1)]
load_df.reset_index(drop = True, inplace = True)
# Train - Min up to 2021
# Val - 2021 up to 2023
# Test - 2023 to Max
train_cutoff = '1/1/2019'
test_cutoff = '1/1/2022'
train = {}
val = {}
test = {}
for zone in zones:
train[zone] = []
val[zone] = []
test[zone] = []
train[zone] = load_df.loc[(load_df['zone'] == zone) &
(load_df['Date'] < pd.to_datetime(train_cutoff))]
val[zone] = load_df.loc[(load_df['zone'] == zone) &
(load_df['Date'] >= pd.to_datetime(train_cutoff)) &
(load_df['Date'] < pd.to_datetime(test_cutoff))]
test[zone] = load_df.loc[(load_df['zone'] == zone) &
(load_df['Date'] >= pd.to_datetime(test_cutoff))]
# Drop rows with NaN created from feature engineering.
train[zone] = train[zone][~train[zone].isnull().any(axis = 1)]
Because of the way a few models work, the leap year in the validation set was causing issues. It will be removed here for now until another solution is worked on. Either way it shouldn't have much importance to the results.
# Leap year in validation causing issues in seasonal naive model.
# Removing for now.
for zone in zones:
leap_year_idx = val[zone].loc[val[zone]['Date'].between('2020-02-29 00:00:00',
'2020-02-29 23:00:00',
inclusive = 'both')].index
val[zone] = val[zone].drop(leap_year_idx)
Sets will be placed into dictionaries keyed by zone name, so models can iterate through each zone and fit/predict separately.
X_train, X_val, X_test = {}, {}, {}
y_train, y_val, y_test = {}, {}, {}
for zone in zones:
X_train[zone], X_val[zone], X_test[zone] = [], [], []
y_train[zone], y_val[zone], y_test[zone] = [], [], []
# Train
X_train[zone] = train[zone].set_index('Date')
y_train[zone] = X_train[zone]['mw']
X_train[zone] = X_train[zone].iloc[:, 2:]
# Validation
X_val[zone] = val[zone].set_index('Date')
y_val[zone] = X_val[zone]['mw']
X_val[zone] = X_val[zone].iloc[:, 2:]
# Test
X_test[zone] = test[zone].set_index('Date')
y_test[zone] = X_test[zone]['mw']
X_test[zone] = X_test[zone].iloc[:, 2:]
Now, plot the split regions for reference.
Note:
def set_region_overlay(set_label, start_date, end_date, x_offset):
mid_date_train = (list(pd.date_range(start = pd.to_datetime(start_date), end = pd.to_datetime(end_date))))
plt.text(x = mid_date_train[int(len(mid_date_train) / 2) + x_offset],
y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
s = set_label,
rotation = 'horizontal',
weight = 'extra bold',
fontsize = 'large',
antialiased = True,
alpha = 1,
c = 'white',
bbox = dict(facecolor = 'black', edgecolor = 'black', boxstyle = 'round', alpha = 0.6))
return None
plt.subplots(figsize = (14,4))
sns.lineplot(load_day_df, x = 'Date', y = 'mw_norm', hue = 'zone', palette=sns.color_palette(palette, 5), lw = 0.7, alpha = 0.7)
plt.margins(x = 0)
plt.axvline(pd.to_datetime(train_cutoff), color = 'purple', ls = '--')
plt.axvline(pd.to_datetime(test_cutoff), color = 'purple', ls = '--')
plt.axvspan(load_day_df['Date'].min(), pd.to_datetime(train_cutoff), facecolor='#2b0054', alpha = 0.4)
plt.axvspan(pd.to_datetime(train_cutoff), pd.to_datetime(test_cutoff), facecolor='#004a54', alpha = 0.4)
plt.axvspan(pd.to_datetime(test_cutoff), load_day_df['Date'].max(), facecolor='#5e3904', alpha = 0.4)
set_region_overlay('T R A I N', load_day_df['Date'].min(), train_cutoff, 0)
set_region_overlay('V A L', train_cutoff, test_cutoff, -225)
set_region_overlay('T E S T', train_cutoff, load_day_df['Date'].max(), 225)
plt.legend().remove()
plt.show()
An important question arises when using a validation set to tune your models in a time series domain because of the temporal nature of the data. Oftentimes, but not always, the most recent observations have the biggest impact on future forecasted predictions. This can be an issue in the process of hyperparameter optimization because the fit model now contains a time-gap between the training set and the testing set.
A choice remains:
In this case, choice 2 will be the chosen path, but both have their merits and are highly dependant on the dataset and target value.
# Creating this as a global for now.
horizon_selector = {'6H' : pd.Timedelta(hours = 6-1),
'3D' : pd.Timedelta(hours = (24*3)-1),
'1W' : pd.Timedelta(hours = (24*7)-1),
'1M' : pd.Timedelta(hours = (24*30)-1),
'6M' : pd.Timedelta(hours = (24*181)-1),
'2Y' : pd.Timedelta(hours = (24*365*2)-1)}
def generate_test_horizon(zone, forecast_horizon, X_y):
"""Generates slice of test set according to input zone, forecast window, and either X or y - always starting at the minimum date [ :forecast_horizon].
Parameters:
zone (string): String of zone data to retrieve.
forecast_horizon (string): 6H, 3D, 1W, 1M, 6M, 2Y which will slice test up to that amount of time.
X_y (string): X or y to choose which to generate.
Returns:
Numpy Array of test[zone] interval up to the forecast window."""
if forecast_horizon not in horizon_selector:
raise Exception('That forecast window is not supported, please choose one of the following: 6H, 3D, 1W, 1M, 6M, 2Y.')
date_min = test[zone]['Date'].min()
if X_y == 'X':
_test = test[zone].loc[test[zone]['Date'].between(date_min,
date_min + horizon_selector[forecast_horizon],
inclusive = 'both')].drop(['zone', 'Date', 'mw'], axis = 1)
if X_y == 'y':
_test = test[zone].loc[test[zone]['Date'].between(date_min,
date_min + horizon_selector[forecast_horizon],
inclusive = 'both')]['mw'].to_numpy()
return _test
A baseline model is needed to compare all other models to. Here we will initialize two different approaches.
Note: The time series data used here has little to no trend/drift (increasing or decreasing over time), but it would be useful to add a baseline that could handle it if we did. A baseline that captures and forecasts the slope of that trend could be used analogous to the mean model here.
The Seasonal Naive Forecasting Model is a simple but often effective baseline model. It will be setup here for different forecast horizons from which all other models will be measured against.
This model forecasts future values to be equal to past values of the target variable (MW) shifted backwards one period of the input season. As we predict different forecast horizons, the length of the horizon can be used as a seasonal-lag to adjust the naive forecast lengths.
For example:
If we want to forecast a week into the future, the previously observed values from the same week last year will be used as the forecast.
Note:
Since the validation set will not be necessary in these baseline models, they are being utilized here instead of train since the validation set contains the last observed values of the last seasonal period.
# TODO: Generalize this section better. Handle all horizons and generalize y_pred method.
# Find date for each forecast window (corresponding date in validate forecast_window)
naive_date = {'1H' : (val[zone]['Date'].max() + pd.Timedelta(hours = 1)) - pd.Timedelta(days = 365),
'6H' : (val[zone]['Date'].max() + pd.Timedelta(hours = 6)) - pd.Timedelta(days = 365),
'3D' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*3)) - pd.Timedelta(days = 365),
'1W' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*7)) - pd.Timedelta(days = 365),
'1M' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*30)) - pd.Timedelta(days = 365),
'6M' : (val[zone]['Date'].max() + pd.Timedelta(hours = 24*181)) - pd.Timedelta(days = 365),
'1Y' : (val[zone]['Date'].max()),
'2Y' : (val[zone]['Date'].max())}
# Initialize prediction dictionary - {zone : {horizon : y_pred}}
naive = {}
for zone in zones:
naive[zone] = {}
for horizon in horizon_selector:
naive[zone][horizon] = []
# 1H and 2Y have different methods currently.
if horizon == '1H':
naive[zone][horizon] = val[zone].loc[val[zone]['Date'] == naive_date[horizon]]['mw'].to_numpy()
continue
if horizon == '2Y':
naive[zone][horizon] = val[zone].loc[val[zone]['Date'].between(naive_date['1H'] - pd.Timedelta(days = 366),
naive_date[horizon], inclusive = 'both')]['mw'].to_numpy()
continue
# Create naive predictions (y_pred) of 1 Hour, 3 Day, 1 Week, 1 Month, 6 Months, 1 Year, and 2 Years forecast windows.
naive[zone][horizon] = val[zone].loc[val[zone]['Date'].between(naive_date['1H'],
naive_date[horizon], inclusive = 'both')]['mw'].to_numpy()
# Create 1Y for example plot later. Only used for that.
naive[zone]['1Y'] = val[zone].loc[val[zone]['Date'].between(naive_date['1H'],
naive_date[horizon], inclusive = 'both')]['mw'].to_numpy()
display(naive['CE']['6H'])
array([9988.853, 9715.127, 9443.542, 9263.926, 9183.058, 9194.837])
print(len(naive['AE']['6M'])/(365*24))
print(len(naive['AE']['2Y'])/(365*24))
0.4958904109589041 2.0
plt.subplots(figsize=(10,4))
sns.lineplot(val['CE'], x = 'Date', y = 'mw', color = zones_palette['CE'], alpha = 0.8)
sns.lineplot(x = pd.date_range(test['CE']['Date'].min(), periods = 24*365, freq='H'), y = naive['CE']['1Y'], color = 'purple', alpha = 1)
sns.lineplot(x = pd.date_range(test['CE']['Date'].min(), periods = 24*365, freq='H'), y = naive['CE']['1Y'], color = 'purple', alpha = 1)
sns.lineplot(val['CE'].iloc[-365*24:], x = 'Date', y = 'mw', color = 'purple', alpha = 0.4)
plt.axvline(val['CE']['Date'].max(), color = 'black', ls = '--')
plt.axvline(val['CE']['Date'].iloc[-365*24], color = 'black', ls = '--', alpha = 0.5)
plt.xlim(val['CE']['Date'].max() - pd.Timedelta(days=365*3), val['CE']['Date'].max() + pd.Timedelta(days=365))
plt.axvspan(val['CE']['Date'].max(), val['CE']['Date'].max() + pd.Timedelta(days=365), color = 'purple', alpha = 0.2)
plt.title('Seasonal Naive Forecast Example')
plt.show()
A very simple baseline model can be had by taking the mean of the historical time series data and using that mean to predict all future values.
mean = {}
for zone in zones:
mean[zone] = {}
for horizon in horizon_selector:
mean[zone][horizon] = []
y_test_horizon = generate_test_horizon(zone, horizon, 'y')
y_mean = load_df.loc[(load_df['zone'] == zone) & (load_df['Date'] < test_cutoff)]['mw'].mean()
y_pred = np.full_like(y_test_horizon, fill_value = y_mean)
mean[zone][horizon] = y_pred
mean['PEP']['6H']
array([3427.67975695, 3427.67975695, 3427.67975695, 3427.67975695,
3427.67975695, 3427.67975695])
plt.subplots(figsize=(10,4))
sns.lineplot(train['PEP'], x = 'Date', y = 'mw', color = zones_palette['PEP'], alpha = 0.8)
plt.axhline(train['PEP']['mw'].mean(), color = 'purple')
plt.axvline(train['PEP']['Date'].max(), color = 'black', ls = '--')
plt.xlim(train['PEP']['Date'].max() - pd.Timedelta(days=365*3), train['PEP']['Date'].max() + pd.Timedelta(days=365))
plt.axvspan(train['PEP']['Date'].max(), train['PEP']['Date'].max() + pd.Timedelta(days=365), color = 'purple', alpha = 0.2)
plt.title('Mean Model Forecast Example')
Text(0.5, 1.0, 'Mean Model Forecast Example')
Before fitting any type of AutoRegressive Integrated Moving Average (ARIMA) model, the time series data needs to be checked if it is stationary, or has a time-dependent structure. Then, if the data is indeed stationary, the next step is to analyze the Autocorrelation Function (ACF) and Partial Autocorrelation Functions (PACF) to help identify seasonal parameters for the model setup.
The plots in previous sections of the data were already a decent indication of the data's stationarity (also the fourier transform should confirm the upcoming ACF and PACF plots), but a statistical test can be performed to help confirm stationarity.
Augmented Dickey-Fuller Test:
for zone in zones:
stationarity = (adfuller(y_train[zone]))
# Check that the test statistic is less than the 5% confidence interval and
# the p-value is less than 0.05.
if (stationarity[1] < 0.05) & (stationarity[4]['5%'] > stationarity[0]):
print(f"{zone} | Stationary! With a p-value of {stationarity[1]}, we reject the null hypothesis of Augmented Dickey-Fuller test.")
else:
print(f"{zone} | Not Stationary! With a p-value of {stationarity[1]}, we fail to reject the null hypothesis of Augmented Dickey-Fuller test.")
AE | Stationary! With a p-value of 1.1607469326048218e-12, we reject the null hypothesis of Augmented Dickey-Fuller test. CE | Stationary! With a p-value of 0.0, we reject the null hypothesis of Augmented Dickey-Fuller test. DOM | Stationary! With a p-value of 5.989020628296604e-29, we reject the null hypothesis of Augmented Dickey-Fuller test. JC | Stationary! With a p-value of 1.3026665225491982e-18, we reject the null hypothesis of Augmented Dickey-Fuller test. PEP | Stationary! With a p-value of 0.0, we reject the null hypothesis of Augmented Dickey-Fuller test.
Now that it has been proven the time series data is stationary, meaning no transformations need to be performed and the ACF and PACF plots can be used.
# Plot 2 Years
plot_acf(y_train['AE'], lags = 24*365*2, auto_ylims = True)
# Plot 1 Day
plot_acf(y_train['AE'], lags = 30, auto_ylims = True)
plt.show()
# Identifying the peaks and valleys outside the confidence intervals.
# Divide by 24 hours to find the days.
print(2450/24, 'Days')
print(8700/24, 'Days')
102.08333333333333 Days 362.5 Days
plot_pacf(y_train['AE'], lags = 30)
plt.show()
Analyzing the ACF and PACF Plots:
Potential Issues: Our view of the data in these plots might not be at the correct scale to properly adjust the SARIMA model. This can be adjusted of course however, SARIMA also struggles with large datasets especially with added exogenous covariates. Also, the fact that several seasonal periods can be identified in our data means SARIMA won't find a proper fit as it will always only fit one of the cycles - 24 hour cycle, year cycle, etc.. This also can be fixed by decomposing the signal and using fourier terms but is outside the scope of this project.
Turned off for now (see below).
# order = (3, 0, 10) # (p-integration, d-AR, q-MA)
# seasonal_order = (0, 0, 0, 24) # (p, d, q, s-seasonal period)
# sarima_model = sm.tsa.SARIMAX(endog = y_train['AE'].asfreq('H'),
# # Only FFT for now.
# #exog = X_train['AE']['FFT_Freq'].asfreq('H'),
# order = order,
# seasonal_order = seasonal_order,
# enforce_stationarity = False,
# enforce_invertibility = False,
# simple_differencing = True
# ).fit()
# sarima_model.summary()
# # Visualize validation set prediction to tune model.
# plt.subplots(figsize = (12,6))
# sns.lineplot(sarima_model.forecast(35040))#, exog = X_val['AE']['FFT_Freq']))
# sns.lineplot(y_val['AE'], c = zones_palette['AE'], alpha = 0.5)
# # plt.xlim(right = pd.to_datetime('2021-07-01'))
# # plt.ylim(bottom = 0, top = 3000)
The amount of compute it takes to fit this model properly for long forecast horizons exceeds the abilities of my machine, even with fitting and validating on one of the zones with the shortest time period. Unfortunately this means the SARIMAX model will not be completed here, however this exercise does illustrate how effective time series forecasting requires an art of balancing trade-offs.
Potential Solutions:
If the goal was only short forecast horizons, you could get away with much less training data - say 1 week for hourly forecasts or 1 month for week forecasts, etc. However, this removes the effectiveness of some of the features
A solution for long forecast horizons could be to smooth and resample the original data and use SARIMAX on that, only focusing on one of the longer seasonal periods like a year. Though, this method would yield poor performance which wouldn't perform too much better than a rolling average forecasting model when evaluating it back to the real observed target variable.
Holt-Winters’ method (Sometimes referred to as triple exponential smoothing) is an extension of exponential smoothing for data that contains both trend and seasonality.
Here you can choose either a model additive or multiplicative error terms and you also need to set a season length. During validation, a season length of 1 week (247) performed well on short forecast horizons but poor at longer horizons. A year (24365) would perform well, but due to lengthy training times, 1 week was chosen.
holtwinters_y_pred = {}
for zone in zones:
print('Currently Fitting:', zone)
holtwinters_y_pred[zone] = {}
# Create train+val for final fit.
temp_df = pd.concat([train[zone], val[zone]], axis = 0)
season_length = 24*7 # 1 Week
# Fit model per zone.
holtwinters_model = HoltWinters(season_length = season_length, error_type = 'A')
holtwinters_model = holtwinters_model.fit(y = temp_df['mw'].to_numpy(), X = temp_df.iloc[:, 3:].to_numpy())
for horizon in horizon_selector:
holtwinters_y_pred[zone][horizon] = []
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
# Predict per horizon.
holtwinters_y_pred[zone][horizon] = holtwinters_model.predict(h = horizon_freq, X = test[zone].iloc[:horizon_freq])['mean']
# y_test_horizon = y_val[zone].iloc[:horizon_freq]
# print(horizon, np.sqrt(sum((y_test_horizon - holtwinters_y_pred['mean'])**2)/len(y_test_horizon)))
# plt.subplots(figsize=(10,2))
# sns.lineplot(x = y_test_horizon.index, y = y_test_horizon)
# sns.lineplot(x = y_test_horizon.index, y = holtwinters_y_pred['mean'], color = 'purple')
# plt.show()
Currently Fitting: AE Currently Fitting: CE Currently Fitting: DOM Currently Fitting: JC Currently Fitting: PEP
Note: Currently implemented without exogenous variables. This model has the ability to include them but required too much compute for current hardware during setup validations.
# MSTL across all zones and horizons.
mstl_models = {}
for zone in zones:
mstl_models[zone] = {}
print('Currently Fitting:', zone)
for horizon in horizon_selector:
if horizon in ['6M', '2Y']:
continue # Skip, too long for this model.
print(f"{'':<5}Forecast Horizon: {horizon}")
mstl_models[zone][horizon] = []
# Convert horizon which is in pd.timedelta to hours integer.
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
model = MSTL(season_length = [24, 24*7, 24*182, 24*365],
trend_forecaster = AutoARIMA(prediction_intervals = ConformalIntervals(h = horizon_freq, n_windows = 4)))
mstl_models[zone][horizon] = model.fit(y = y_val[zone])
Currently Fitting: AE
Forecast Horizon: 6H
Forecast Horizon: 3D
Forecast Horizon: 1W
Forecast Horizon: 1M
Currently Fitting: CE
Forecast Horizon: 6H
Forecast Horizon: 3D
Forecast Horizon: 1W
Forecast Horizon: 1M
Currently Fitting: DOM
Forecast Horizon: 6H
Forecast Horizon: 3D
Forecast Horizon: 1W
Forecast Horizon: 1M
Currently Fitting: JC
Forecast Horizon: 6H
Forecast Horizon: 3D
Forecast Horizon: 1W
Forecast Horizon: 1M
Currently Fitting: PEP
Forecast Horizon: 6H
Forecast Horizon: 3D
Forecast Horizon: 1W
Forecast Horizon: 1M
# .predict() each model.
mstl_y_pred = {}
for zone in zones:
mstl_y_pred[zone] = {}
for horizon in horizon_selector:
mstl_y_pred[zone][horizon] = []
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
# Too long for this model.
if horizon in ['6M', '2Y']:
continue
mstl_y_pred[zone][horizon] = mstl_models[zone][horizon].predict(h = horizon_freq, level = [80, 95])
# Plot all zones and horizon predictions.
for zone in zones:
fig, ax = plt.subplots(4, 1, figsize = (8, 5), sharex = True)
for i, horizon in enumerate(horizon_selector):
if horizon in ['6M', '2Y']:
continue # Skip, too long for this model.
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
plot_df = pd.concat([y_test[zone].iloc[:horizon_freq],
pd.Series(mstl_y_pred[zone][horizon]['mean'],
index = y_test[zone].iloc[:horizon_freq].index,
name = 'y_hat')],axis = 1)
ax[i].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.9, label = 'True' if i == 0 else '')
ax[i].plot(plot_df.index, plot_df['y_hat'], c = 'purple', alpha = 0.9, label = 'Forecast' if i == 0 else '')
ax[i].fill_between(x = plot_df.index[-horizon_freq: ],
y1 = mstl_y_pred[zone][horizon]['lo-80'][-horizon_freq: ],
y2 = mstl_y_pred[zone][horizon]['hi-80'][-horizon_freq: ],
alpha = 0.2, color = '#3fb4fc', label = 'CI - 80' if i == 0 else '')
ax[i].fill_between(x = plot_df.index[-horizon_freq: ],
y1 = mstl_y_pred[zone][horizon]['lo-95'][-horizon_freq: ],
y2 = mstl_y_pred[zone][horizon]['hi-95'][-horizon_freq: ],
alpha = 0.2, color = '#82ccfa', label = 'CI - 95' if i == 0 else '')
ax[i].grid()
ax[i].set_title(horizon)
fig.suptitle(f"{zone} - MSTL | All Forecast Horizons")
fig.legend(loc = 'upper right', ncol = 2)
fig.tight_layout()
plt.show()
# Reformat into y_pred dict to match other models.
mstl_y_pred_rfmt = {}
for zone in zones:
mstl_y_pred_rfmt[zone] = {}
for horizon in horizon_selector:
mstl_y_pred_rfmt[zone][horizon] = []
# Set missing horizon results to N/A so it shows up in results.
if horizon in ['6M', '2Y']:
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
mstl_y_pred_rfmt[zone][horizon] = np.full((horizon_freq),'N/A')
continue
mstl_y_pred_rfmt[zone][horizon] = mstl_y_pred[zone][horizon]['mean']
# Centered and Standardized for use in SVMs.
scaler = StandardScaler()
X_train_scaled = {}
X_val_scaled = {}
X_test_scaled = {}
for zone in zones:
X_train_scaled[zone] = []
X_val_scaled[zone] = []
X_test_scaled[zone] = []
X_train_scaled[zone] = pd.DataFrame(scaler.fit_transform(X_train[zone]), columns = scaler.get_feature_names_out())
X_val_scaled[zone] = pd.DataFrame(scaler.fit_transform(X_val[zone]), columns = scaler.get_feature_names_out())
X_test_scaled[zone] = pd.DataFrame(scaler.transform(X_test[zone]), columns = scaler.get_feature_names_out())
# # Grid search for best parameters.
# # Utilized and commented out to save rerun time.
# params = {'C' : [750, 900, 1024], 'gamma' : [0.01, 0.1], 'kernel' : ['linear', 'rbf']}
# #ps = PredefinedSplit(test_fold = val_fold)
# grid_svm = GridSearchCV(SVR(), param_grid = params, cv = 2, verbose = 3).fit(X_train_scaled['JC'], y_train['JC'])
# mod_svm = grid_svm.best_estimator_
# print(mod_svm)
# print(mod_svm.score(X_train_scaled['JC'], y_train['JC']))
# SVM on all zones
svm_val_results = {}
svm_models = {}
for zone in zones:
print('Currently Fitting:', zone)
svm_val_results[zone] = []
svm_models[zone] = []
# Concat train and val sets for final training.
X_temp_train_df = pd.concat([X_train_scaled[zone], X_val_scaled[zone]]).reset_index(drop = True)
y_temp_train_df = pd.concat([y_train[zone], y_val[zone]]).reset_index(drop = True)
# Fit model.
#svm_model = SVR(C = 32768, gamma = 7.62939453125e-06, kernel = 'linear')
svm_model = SVR(C = 900, gamma = 0.01, kernel = 'rbf')
svm_model.fit(X_temp_train_df, y_temp_train_df)
svm_models[zone] = svm_model
Currently Fitting: AE Currently Fitting: CE Currently Fitting: DOM Currently Fitting: JC Currently Fitting: PEP
svm_y_pred = {}
for zone in zones:
svm_y_pred[zone] = {}
for horizon in horizon_selector:
svm_y_pred[zone][horizon] = []
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
X_test_horizon = X_test_scaled[zone].iloc[:horizon_freq]
svm_y_pred[zone][horizon] = svm_models[zone].predict(X_test_horizon)
# # Grid search for best parameters.
# # Utilized and commented out to save rerun time.
# params = {'max_depth' : [3,5,7,10], 'gamma' : [0.05,0.1,0.5,5], 'learning_rate': [0.1]}
# grid_xgb = GridSearchCV(xgb.XGBRegressor(booster = 'gbtree',
# n_estimators = 5000,
# early_stopping_rounds = 100,
# objective = 'reg:squarederror',
# eval_metric = 'rmse'),
# param_grid = params,
# cv = 2,
# verbose = 1)
# grid_xgb.fit(X_train['AE'], y_train['AE'],
# eval_set = [(X_train['AE'], y_train['AE']), (X_val['AE'], y_val['AE'])],
# verbose = 1000)
# grid_xgb.best_estimator_
# XGBoost on all zones
xgb_val_results = {}
xgb_models = {}
for zone in zones:
print(zone)
xgb_val_results[zone] = []
xgb_models[zone] = []
# Concat train and val sets for final training.
X_temp_train_df = pd.concat([X_train[zone], X_val[zone]]).reset_index(drop = True)
y_temp_train_df = pd.concat([y_train[zone], y_val[zone]]).reset_index(drop = True)
xgb_model = xgb.XGBRegressor(booster = 'gbtree',
n_estimators = 6000,
max_depth = 7,
learning_rate = 0.1,
gamma = 0.1,
early_stopping_rounds = 100,
objective = 'reg:squarederror',
eval_metric = 'rmse')
xgb_model.fit(X_temp_train_df, y_temp_train_df,
eval_set = [(X_temp_train_df, y_temp_train_df)],
verbose = 2000)
xgb_val_results[zone] = xgb_model.best_score
xgb_models[zone] = xgb_model
AE [0] validation_0-rmse:307.48081 [2000] validation_0-rmse:10.97208 [4000] validation_0-rmse:4.79145 [5999] validation_0-rmse:2.44426 CE [0] validation_0-rmse:2137.47512 [2000] validation_0-rmse:107.36722 [4000] validation_0-rmse:60.90634 [5999] validation_0-rmse:40.16256 DOM [0] validation_0-rmse:2223.40895 [2000] validation_0-rmse:134.66029 [4000] validation_0-rmse:76.31948 [5999] validation_0-rmse:49.97660 JC [0] validation_0-rmse:657.91238 [2000] validation_0-rmse:24.18403 [4000] validation_0-rmse:10.17122 [5999] validation_0-rmse:5.05276 PEP [0] validation_0-rmse:715.47620 [2000] validation_0-rmse:53.81193 [4000] validation_0-rmse:35.40717 [5999] validation_0-rmse:26.46956
xgb_y_pred = {}
for zone in zones:
xgb_y_pred[zone] = {}
for horizon in horizon_selector:
xgb_y_pred[zone][horizon] = []
X_test_horizon = generate_test_horizon(zone, horizon, 'X')
xgb_y_pred[zone][horizon] = xgb_models[zone].predict(X_test_horizon)
xgb.plot_importance(xgb_model)
plt.show()
xgb.plot_tree(xgb_model, num_trees=12)
<Axes: >
plt.subplots(figsize = (10,4))
sns.lineplot(generate_test_horizon('AE', '2Y', 'y'), c = zones_palette['AE'], alpha = 0.8)
sns.lineplot(xgb_y_pred['AE']['2Y'], c = 'purple', alpha = 0.8)
<Axes: >
First we need to prepare the data for use in these models.
Since we won't be able to handle training of the full dataset we will just use a small slice before the Test Set cutoff for training and slice the Test Set for different forecast horizons.
Note: All tuning was done on Train and Val sets.
# MacOS - Set cuda device to mps.
print(torch.backends.mps.is_available())
torch.device('mps')
# Create dataframe in Nixtla/Pytorch format
load_df.reset_index(drop = False, inplace = True)
load_df.rename(columns = {'index' : 'time_idx'}, inplace = True)
# Add 7 days of test data that will be used during .predict()
pytorch_df = load_df.loc[(load_df['Date'] >= pd.to_datetime(train_cutoff)) &
(load_df['Date'] < (pd.to_datetime(test_cutoff) + pd.Timedelta(days = 30)))] # Set horizon here.
# Less to assign later if we just rename these columns.
pytorch_df = pytorch_df.rename(columns = {'Date' : 'ds', 'mw' : 'y', 'zone' : 'unique_id'})
# Build static df to groupby zone in Nixtla.
pytorch_df_static = pd.DataFrame({'unique_id' : zones,
'zones' : zones})
pytorch_df_static = pytorch_df_static.join(pd.get_dummies(pytorch_df_static['zones'], dtype = int))
pytorch_df_static = pytorch_df_static.drop(['zones'], axis = 1)
pytorch_df_static
True
| unique_id | AE | CE | DOM | JC | PEP | |
|---|---|---|---|---|---|---|
| 0 | AE | 1 | 0 | 0 | 0 | 0 |
| 1 | CE | 0 | 1 | 0 | 0 | 0 |
| 2 | DOM | 0 | 0 | 1 | 0 | 0 |
| 3 | JC | 0 | 0 | 0 | 1 | 0 |
| 4 | PEP | 0 | 0 | 0 | 0 | 1 |
# Train: 2021-10-03 00:00:00 to 2021-12-31 23:00:00
pytorch_train = pytorch_df[pytorch_df.ds.between(pd.to_datetime(test_cutoff) - pd.Timedelta(days = 90), test_cutoff, inclusive = 'left')]
# Test: 2022-01-01 00:00:00 to 2022-01-30 23:00:00
pytorch_test = pytorch_df[pytorch_df.ds >= test_cutoff].reset_index(drop = True)
# Find size of input and horizon. Will be same across zones.
input_size_train = int(len(pytorch_train[pytorch_train.unique_id == 'AE']) / 2.75) # 2.75 to give a few periods to train.
h_size_test = len(pytorch_test[pytorch_test.unique_id == 'AE'])
print(input_size_train)
print(h_size_test)
785 720
Developed as an improvement over the popular N-BEATS deep-learning model. The model is composed of several multilayer perceptrons (MLPs) with ReLU nonlinearities. Put simply, N-HiTS interpolates the output across blocks which each specializes in separate signal frequencies.
The paper on N-HiTS was released in November, 2022 and is worth looking through (https://arxiv.org/abs/2201.12886).
As with other models in this project, we are limited by the hardware this is run on so this is by no means a full test on this dataset. The training input was reduced to 90 days and forecast limited to 1 month (30*24). Exogenous features were slowing down training too much to be included and were causing the model to stay in a loss saddle.
Note:
Unable to resolve error that causes the kernel to crash when running N-HiTS when importing both statsforecast.models import MSTL, AutoARIMA, HoltWinters AND neuralforecast.models import NHITS, TFT.
For now, training model, saving prediction dataframes to .csv which can then be loaded after commenting out model training code block.
# # Set environment variable since pytorch still lacks some functionality on apple silicon.
# os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'
# nhits_y_pred = {}
# for horizon in horizon_selector:
# if horizon in ['6M', '2Y']:
# continue # skipping these horizons for this model.
# print(f'Currently Training: {horizon}')
# nhits_y_pred[horizon] = []
# horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
# models = [NHITS(input_size = input_size_train,
# h = horizon_freq,
# max_steps = 1000,
# learning_rate = 0.001,
# n_pool_kernel_size = [2, 2, 1],
# n_freq_downsample = [4, 2, 1],
# mlp_units = 3 * [[512, 512]],
# accelerator = 'cpu',
# stat_exog_list = ['AE', 'CE', 'DOM', 'JC', 'PEP']),
# ]
# nhits_model = NeuralForecast(models = models, freq = 'H')
# nhits_model.fit(df = pytorch_train, static_df = pytorch_df_static)
# # Predict.
# nhits_y_pred[horizon] = nhits_model.predict(futr_df = pytorch_test.iloc[:, 1:4]).reset_index()
# # Save model.
# nhits_model.save(path=current_wdir + f'/Models/NHITS/{horizon}',
# model_index=None,
# overwrite=True,
# save_dataset=True)
Comment out save as needed.
# # Save predictions to .csv files.
# # See note at section header.
# for horizon in nhits_y_pred.keys():
# nhits_y_pred[horizon].to_csv(current_wdir + f'/Models/NHITS/{horizon}/{horizon}_y_pred.csv.gz', compression = 'gzip', index = False)
Load previously trained model predictions and reformat to align with other models.
# Load predictions from last model fit.
nhits_y_pred = {}
for horizon in horizon_selector:
if horizon in ['6M', '2Y']:
continue # Not in this model.
file = current_wdir + f'/Models/NHITS/{horizon}/{horizon}_y_pred.csv.gz'
nhits_y_pred[horizon] = []
nhits_y_pred[horizon] = pd.read_csv(file, compression = 'gzip')
# Reformat into y_pred dict to match other models.
nhits_y_pred_rfmt = {}
for zone in zones:
nhits_y_pred_rfmt[zone] = {}
for horizon in horizon_selector:
nhits_y_pred_rfmt[zone][horizon] = []
# Set missing horizon results to N/A so it shows up in results.
if horizon not in nhits_y_pred.keys():
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
nhits_y_pred_rfmt[zone][horizon] = np.full((horizon_freq),'N/A')
continue
nhits_y_pred_rfmt[zone][horizon] = nhits_y_pred[horizon].query(
'unique_id == @zone')['NHITS'].to_numpy()
As the name implies, it is based on a transformer multi-headed self-attention architecture that temporal data is fed into. It couples the transformer decoder with inputs of LSTM encoder/decoders for the static covariates. This powerful model is a great example of modern hybrid models using different architectures that specialize to effectively encode and analyze different sections of the feature space.
However, as the theme continues, my non-parallelized machine struggled to keep up with the demands of the TFT architecture. Though the results were promising, I was able to forecast up to 1 week from just ~40 days of training data and removing all exogenous covariates, but it still took 500+ minutes of training. For now this model will be deactivated until this is revisited in the future.
# # Set environment variable since pytorch lacks some functionality on apple silicon still.
# os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'
# tft_model = NeuralForecast(
# models = [TFT(h = h_size_test,
# input_size = input_size_train, # AR input size (t-input_size : t)
# hidden_size = 20,
# loss = DistributionLoss(distribution = 'StudentT', level = [80, 90]),
# learning_rate = 0.005,
# stat_exog_list = ['AE', 'CE', 'DOM', 'JC', 'PEP'],
# futr_exog_list = ['MaxTemperature', 'MinTemperature', 'AvgTemperature',
# 'Precipitation', 'Hour', 'Day',
# 'Month', 'Year', 'Day_of_Year',
# 'Rolling_Mean_3', 'Rolling_Mean_6', 'Lag_Hour',
# 'Lag_6_Hours', 'Lag_Day', 'Lag_Week',
# 'Lag_Month', 'Deriv_1', 'Deriv_2',
# 'FFT', 'FFT_Freq'],
# hist_exog_list = ['time_idx'],
# max_steps = 500,
# val_check_steps = 10,
# early_stop_patience_steps = 10,
# scaler_type = 'robust',
# windows_batch_size = None,
# enable_progress_bar = True,
# accelerator = 'cpu', # set to cpu for now, mps not working..
# # start_padding_enabled = True
# ),
# ],
# freq = 'H' # Hourly
# )
# tft_model.fit(df = pytorch_train, static_df = pytorch_df_static, val_size = h_size_test)
# tft_y_pred = tft_model.predict(futr_df = pytorch_test)
# # Plot predictions
# tft_y_pred = tft_y_pred.reset_index(drop = False).drop(columns = ['unique_id', 'ds'])
# for zone in zones:
# plt.subplots(figsize = (10, 4))
# plot_df = pd.concat([pytorch_test, tft_y_pred], axis = 1)
# plot_df = pd.concat([pytorch_train, plot_df])
# plot_df = plot_df[plot_df.unique_id == zone].drop('unique_id', axis = 1)
# plt.plot(plot_df['ds'], plot_df['y'], c = zones_palette[zone], label = 'True')
# plt.plot(plot_df['ds'], plot_df['TFT'], c = 'purple', label = 'Forecast')
# plt.fill_between(x = plot_df['ds'][-h_size_test: ],
# y1 = plot_df['TFT-lo-90'][-h_size_test: ].values,
# y2 = plot_df['TFT-hi-90'][-h_size_test: ].values,
# alpha = 0.2, color = '#82ccfa', label = 'CI 90')
# plt.title(zone)
# plt.legend()
# plt.grid()
# plt.plot()
# # Save model.
# pytorch_test.to_csv('TFT_Test.csv')
# tft_y_pred.to_csv('TFT_Predict.csv')
# tft_model.save(path=current_wdir + '/Models/TFT/nf2',
# model_index=None,
# overwrite=False,
# save_dataset=True)
# Global for now.
model_selector = {'Naive' : naive,
'Mean' : mean,
'HoltWinters' : holtwinters_y_pred,
'MSTL' : mstl_y_pred_rfmt,
'SVM' : svm_y_pred,
'XGBoost' : xgb_y_pred,
'NHITS' : nhits_y_pred_rfmt
#'TFT' : tft_y_pred
}
reorderdict_horizon = {'6H':0,'3D':1,'1W':2,'1M':3,'6M':4,'2Y':5}
reorderdict_model = {'Naive':0,'Mean':1,'HoltWinters':2,'MSTL':3,'SVM':4,'XGBoost':5,'NHITS':6}
Symmetric Mean Absolute Percentage Error (sMAPE) will be the evaluation metric used to compare all models. This will ensure comparisons of error across different models and forecasting horizons are normalized.
$\text{sMAPE} =\frac{200}{n}\sum_{i=1}^{n}\frac{\lvert y_i - \hat{y_i} \rvert}{\lvert y_i \rvert + \lvert \hat{y_i} \rvert}$
where $y_i$ and $\hat{y_i}$ are the actual values and forecasted values respectively, and $n$ is the number of points in the set.
One possible limitation of sMAPE is if a predicted value or actual value equals zero, the metric balloons to the maximum error value, and if both equals zero, the metric is undefined. Chosen evaluation metrics often have strengths and weaknesses, and it is yet unclear if this limitation will occur here.
def sMAPE(model, forecast_horizon):
sMAPE_result = {}
for zone in zones:
sMAPE_result[zone] = []
y_pred = model_selector[model][zone][forecast_horizon]
# Check model for missing horizon.
if y_pred[0] == 'N/A':
sMAPE_result[zone] = np.nan
continue
y_test_horizon = generate_test_horizon(zone, forecast_horizon, 'y')
sMAPE_result[zone] = (200 / len(y_pred)) * np.sum(np.abs(y_test_horizon-y_pred) / (np.abs(y_test_horizon)+np.abs(y_pred)))
return sMAPE_result
Root Mean Squared Error (RMSE) will also be used in model evaluation to complement sMAPE. The value of including RMSE is that it is an easily interpretable quantitative metric resulting in a value that is in the same unit as the response. It also heavily penalizes predictions the further they are from the actual value.
${RMSE} =\sqrt\frac{\sum_{i=1}^{n}(y_i - \hat{y_i})^2}{n}$
where $y_i$ and $\hat{y_i}$ are the actual values and forecasted values respectively, and $n$ is the number of points in the set.
def RMSE(model, forecast_horizon):
RMSE_result = {}
for zone in zones:
RMSE_result[zone] = []
y_pred = model_selector[model][zone][forecast_horizon]
y_test_horizon = generate_test_horizon(zone, forecast_horizon, 'y')
RMSE_result[zone] = np.sqrt(sum((y_test_horizon - y_pred)**2)/len(y_test_horizon))
return RMSE_result
In conjunction with the RMSE, a Normalized RMSE (NRMSE) will also be used to compare models between zones, since each zone have varying scales of the target variable.
${NRMSE} = \frac{RMSE}{\bar{Y}}$
where $\bar{Y}$ is the mean of the actual values.
def NormRMSE(model, forecast_horizon):
NormRMSE_result = {}
for zone in zones:
NormRMSE_result[zone] = []
y_pred = model_selector[model][zone][forecast_horizon]
# Check model for missing horizon.
if y_pred[0] == 'N/A':
NormRMSE_result[zone] = np.nan
continue
y_test_horizon = generate_test_horizon(zone, forecast_horizon, 'y')
NormRMSE_result[zone] = np.sqrt(sum((y_test_horizon - y_pred)**2)/len(y_test_horizon)) / np.mean(y_test_horizon)
return NormRMSE_result
# Create NRMSE dataframe.
results_NormRMSE = pd.DataFrame()
for model in model_selector:
#print(model)
for horizon in horizon_selector:
# if (model == 'MSTL' or model == 'NHITS') and horizon in ['6M', '2Y']:
# continue # Skip, too long for this model.
#print(horizon)
df_index = pd.MultiIndex.from_product([['NormRMSE'],[model],[horizon]], names = ['Metric', 'Model', 'Horizon'])
results_NormRMSE = pd.concat([results_NormRMSE, pd.DataFrame(NormRMSE(model, horizon), index = df_index)])
results_NormRMSE = results_NormRMSE.sort_index(level='Model', key = lambda x: x.map(reorderdict_model))
#results_NormRMSE.query('Horizon == @horizon').sort_index(level = 'Model')
results_NormRMSE_dict = {}
for horizon in horizon_selector:
results_NormRMSE_dict[horizon] = []
results_NormRMSE_dict[horizon] = results_NormRMSE.query('Horizon == @horizon').style.highlight_min(axis = 0, props = 'background-color:green;', subset = zones)
#display(results_NormRMSE_dict[horizon])
# Create sMAPE dataframe.
results_sMAPE = pd.DataFrame()
for model in model_selector:
#print(model)
for horizon in horizon_selector:
# if (model == 'MSTL' or model == 'NHITS') and horizon in ['6M', '2Y']:
# continue # Skip, too long for this model.
#print(horizon)
df_index = pd.MultiIndex.from_product([['sMAPE'],[model],[horizon]], names = ['Metric', 'Model', 'Horizon'])
results_sMAPE = pd.concat([results_sMAPE, pd.DataFrame(sMAPE(model, horizon), index = df_index)])
results_sMAPE = results_sMAPE.sort_index(level='Model', key = lambda x: x.map(reorderdict_model))
results_sMAPE_dict = {}
for horizon in horizon_selector:
results_sMAPE_dict[horizon] = []
results_sMAPE_dict[horizon] = results_sMAPE.query('Horizon == @horizon').style.highlight_min(axis = 0, props = 'background-color:green;', subset = zones)
#display(results_sMAPE_dict[horizon])
#results_final = {}
for horizon in horizon_selector:
print(f"-------------------------------------- {horizon} --------------------------------------")
#results_final[horizon] = []
#results_final[horizon] = pd.concat([results_NormRMSE_dict[horizon], results_sMAPE_dict[horizon]], axis = 0)
display(results_NormRMSE_dict[horizon])
display(results_sMAPE_dict[horizon])
-------------------------------------- 6H --------------------------------------
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| NormRMSE | Naive | 6H | 0.136277 | 0.087492 | 0.150645 | 0.184597 | 0.235583 |
| Mean | 6H | 0.340522 | 0.314349 | 0.186621 | 0.397084 | 0.488823 | |
| HoltWinters | 6H | 0.014868 | 0.046767 | 0.011711 | 0.020384 | 0.013049 | |
| MSTL | 6H | 0.010456 | 0.042240 | 0.020137 | 0.026714 | 0.025732 | |
| SVM | 6H | 0.054108 | 0.104567 | 0.076867 | 0.085008 | 0.190672 | |
| XGBoost | 6H | 0.045109 | 0.066064 | 0.041571 | 0.035517 | 0.028640 | |
| NHITS | 6H | 0.028110 | 0.045254 | 0.011302 | 0.017790 | 0.016293 |
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| sMAPE | Naive | 6H | 12.580831 | 8.368642 | 13.676927 | 16.850913 | 20.837337 |
| Mean | 6H | 28.959313 | 27.093643 | 16.941791 | 33.020723 | 39.234910 | |
| HoltWinters | 6H | 1.281159 | 4.399481 | 1.074516 | 1.976006 | 1.146197 | |
| MSTL | 6H | 0.924006 | 4.236191 | 1.540820 | 2.242525 | 1.879716 | |
| SVM | 6H | 4.607263 | 9.269047 | 6.653134 | 7.561048 | 18.970828 | |
| XGBoost | 6H | 3.375682 | 6.152227 | 4.115225 | 3.328029 | 2.278010 | |
| NHITS | 6H | 2.248364 | 4.536223 | 1.046940 | 1.470468 | 1.328477 |
-------------------------------------- 3D --------------------------------------
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| NormRMSE | Naive | 3D | 0.119357 | 0.109039 | 0.171870 | 0.113405 | 0.200999 |
| Mean | 3D | 0.208615 | 0.141610 | 0.183489 | 0.219810 | 0.280960 | |
| HoltWinters | 3D | 0.096301 | 0.104986 | 0.128598 | 0.084460 | 0.129481 | |
| MSTL | 3D | 0.132737 | 0.077898 | 0.111262 | 0.133788 | 0.131693 | |
| SVM | 3D | 0.088246 | 0.058785 | 0.094406 | 0.131786 | 0.118316 | |
| XGBoost | 3D | 0.098207 | 0.032681 | 0.082099 | 0.064133 | 0.050121 | |
| NHITS | 3D | 0.120104 | 0.076831 | 0.141640 | 0.110050 | 0.147622 |
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| sMAPE | Naive | 3D | 10.564294 | 8.141624 | 14.682704 | 10.053777 | 17.745348 |
| Mean | 3D | 17.931932 | 11.818520 | 13.749959 | 18.790199 | 24.905756 | |
| HoltWinters | 3D | 6.795567 | 9.117568 | 7.209524 | 6.459062 | 7.638461 | |
| MSTL | 3D | 10.137742 | 6.672670 | 7.716595 | 11.919681 | 9.536786 | |
| SVM | 3D | 6.649310 | 5.024464 | 7.196546 | 9.249314 | 11.662376 | |
| XGBoost | 3D | 7.755542 | 2.698701 | 5.322590 | 4.371418 | 4.240230 | |
| NHITS | 3D | 9.395042 | 7.082582 | 10.380827 | 9.056158 | 12.270452 |
-------------------------------------- 1W --------------------------------------
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| NormRMSE | Naive | 1W | 0.114612 | 0.117962 | 0.137086 | 0.092925 | 0.143650 |
| Mean | 1W | 0.152446 | 0.126032 | 0.221700 | 0.163112 | 0.190239 | |
| HoltWinters | 1W | 0.129742 | 0.162546 | 0.208207 | 0.120589 | 0.212457 | |
| MSTL | 1W | 0.124733 | 0.131192 | 0.119044 | 0.126353 | 0.148563 | |
| SVM | 1W | 0.070804 | 0.055620 | 0.089589 | 0.095575 | 0.089091 | |
| XGBoost | 1W | 0.072556 | 0.045279 | 0.069561 | 0.053372 | 0.052927 | |
| NHITS | 1W | 0.114543 | 0.159284 | 0.171549 | 0.104196 | 0.172830 |
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| sMAPE | Naive | 1W | 9.833909 | 10.046109 | 11.853713 | 7.963604 | 12.959489 |
| Mean | 1W | 13.027409 | 11.028017 | 19.665086 | 13.869585 | 16.964675 | |
| HoltWinters | 1W | 10.738640 | 15.046106 | 17.100893 | 10.238545 | 18.103634 | |
| MSTL | 1W | 10.071614 | 10.309518 | 9.181221 | 10.920468 | 12.810960 | |
| SVM | 1W | 5.717646 | 4.692355 | 6.849986 | 7.107230 | 8.467719 | |
| XGBoost | 1W | 5.374305 | 3.589565 | 5.300548 | 3.915818 | 4.300709 | |
| NHITS | 1W | 9.455587 | 15.526011 | 13.352938 | 9.039466 | 14.380739 |
-------------------------------------- 1M --------------------------------------
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| NormRMSE | Naive | 1M | 0.121239 | 0.100637 | 0.161737 | 0.100371 | 0.148098 |
| Mean | 1M | 0.126805 | 0.101653 | 0.283969 | 0.133212 | 0.162745 | |
| HoltWinters | 1M | 0.167221 | 0.167445 | 0.304187 | 0.176096 | 0.286739 | |
| MSTL | 1M | 0.149286 | 0.101509 | 0.187906 | 0.226123 | 0.189383 | |
| SVM | 1M | 0.084019 | 0.057702 | 0.087019 | 0.084813 | 0.090183 | |
| XGBoost | 1M | 0.067560 | 0.040742 | 0.089747 | 0.068048 | 0.066061 | |
| NHITS | 1M | 0.159065 | 0.212156 | 0.331945 | 0.136681 | 0.299784 |
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| sMAPE | Naive | 1M | 10.163846 | 7.986211 | 13.466885 | 8.498115 | 12.421239 |
| Mean | 1M | 10.515516 | 8.513869 | 28.365910 | 10.961452 | 14.056335 | |
| HoltWinters | 1M | 15.198061 | 16.639998 | 30.038587 | 16.547223 | 28.951588 | |
| MSTL | 1M | 13.202894 | 8.138927 | 16.189155 | 22.238330 | 17.300937 | |
| SVM | 1M | 6.603718 | 4.732827 | 7.115081 | 6.492116 | 7.793409 | |
| XGBoost | 1M | 5.180421 | 3.061406 | 7.495047 | 5.063363 | 5.188263 | |
| NHITS | 1M | 13.960246 | 21.936399 | 31.881035 | 12.069495 | 29.509419 |
-------------------------------------- 6M --------------------------------------
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| NormRMSE | Naive | 6M | 0.195920 | 0.194636 | 0.173055 | 0.207272 | 0.188107 |
| Mean | 6M | 0.240628 | 0.199582 | 0.218384 | 0.231947 | 0.218194 | |
| HoltWinters | 6M | 0.205409 | 0.176356 | 0.221137 | 0.195561 | 0.216097 | |
| MSTL | 6M | nan | nan | nan | nan | nan | |
| SVM | 6M | 0.130074 | 0.093505 | 0.094867 | 0.137879 | 0.129636 | |
| XGBoost | 6M | 0.081284 | 0.054166 | 0.071832 | 0.114838 | 0.060866 | |
| NHITS | 6M | nan | nan | nan | nan | nan |
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| sMAPE | Naive | 6M | 13.572251 | 12.176701 | 13.512928 | 12.933855 | 14.080367 |
| Mean | 6M | 18.941483 | 15.168616 | 16.169793 | 18.350397 | 18.375323 | |
| HoltWinters | 6M | 15.425375 | 12.174435 | 15.976172 | 14.355150 | 15.959039 | |
| MSTL | 6M | nan | nan | nan | nan | nan | |
| SVM | 6M | 9.653313 | 6.016505 | 7.315464 | 9.054637 | 11.440345 | |
| XGBoost | 6M | 6.139266 | 3.743250 | 5.560432 | 6.804394 | 4.516422 | |
| NHITS | 6M | nan | nan | nan | nan | nan |
-------------------------------------- 2Y --------------------------------------
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| NormRMSE | Naive | 2Y | 0.193376 | 0.183827 | 0.169782 | 0.209542 | 0.174865 |
| Mean | 2Y | 0.296964 | 0.207625 | 0.233131 | 0.282705 | 0.225961 | |
| HoltWinters | 2Y | 0.283207 | 0.171086 | 0.231524 | 0.254468 | 0.214478 | |
| MSTL | 2Y | nan | nan | nan | nan | nan | |
| SVM | 2Y | 0.167635 | 0.097154 | 0.102399 | 0.157525 | 0.278998 | |
| XGBoost | 2Y | 0.072943 | 0.058035 | 0.078654 | 0.105739 | 0.054516 | |
| NHITS | 2Y | nan | nan | nan | nan | nan |
| AE | CE | DOM | JC | PEP | |||
|---|---|---|---|---|---|---|---|
| Metric | Model | Horizon | |||||
| sMAPE | Naive | 2Y | 13.257603 | 12.475873 | 14.042065 | 13.456625 | 12.813002 |
| Mean | 2Y | 21.589680 | 16.402006 | 17.586856 | 20.421968 | 18.642190 | |
| HoltWinters | 2Y | 18.524866 | 11.674018 | 18.222272 | 16.527110 | 15.562225 | |
| MSTL | 2Y | nan | nan | nan | nan | nan | |
| SVM | 2Y | 12.905069 | 6.720007 | 8.251229 | 10.808048 | 28.784374 | |
| XGBoost | 2Y | 5.588003 | 4.122110 | 6.700303 | 6.550198 | 4.082743 | |
| NHITS | 2Y | nan | nan | nan | nan | nan |
fig, ax = plt.subplots(figsize = (6, 10))
sns.heatmap(results_NormRMSE, ax = ax, annot = True, fmt = '.3f')
plt.show()
fig, ax = plt.subplots(figsize = (6, 10))
sns.heatmap(results_sMAPE, ax = ax, annot = True, fmt = '.3f')
plt.show()
models_palette = {'Naive' : '#89a832',
'Mean' : 'black',
'HoltWinters' : '#3fd1ca',
'MSTL' : '#d1c000',
'XGBoost' : '#432075',
'SVM' : '#d4359f',
'NHITS' : '#cc6a31',
}
model_selector.keys()
dict_keys(['Naive', 'Mean', 'HoltWinters', 'MSTL', 'SVM', 'XGBoost', 'NHITS'])
for zone in zones:
plt.subplots(figsize=(5,4))
for model in model_selector:
plot_df = results_NormRMSE[zone].to_frame().query('Model == @model').sort_index(level='Horizon', key = lambda x: x.map(reorderdict_horizon))
sns.lineplot(x = reorderdict_horizon.keys(), y = plot_df[zone].values, label = model, color = models_palette[model])
plt.title(f'{zone}')
plt.xlabel('Forecast Horizon')
plt.ylabel('Normalized RMSE')
#TODO: Clean this section up, too convoluted.
# Plot all zones and horizon predictions.
for zone in zones:
# Two subplots for different horizon lengths.
fig, ax = plt.subplots(3, 1, figsize = (11, 4), sharex = True)
fig2, ax2 = plt.subplots(3, 1, figsize = (11, 5), sharex = False)
for i, horizon in enumerate(horizon_selector):
horizon_freq = int((horizon_selector[horizon].total_seconds() / 3600) + 1)
if i < 3: # Separate shorter forecasts from longer ones.
for j, model in enumerate(model_selector):
plot_df = pd.concat([y_test[zone].iloc[:horizon_freq],
pd.Series(model_selector[model][zone][horizon],
index = y_test[zone].iloc[:horizon_freq].index,
name = 'y_hat')],axis = 1)
ax[i].plot(plot_df.index, plot_df['y_hat'], c = models_palette[model], alpha = 0.6, label = f'{model} Forecast' if i == 0 else '')
if j == len(model_selector)-1: # Only need to plot true once. Last so it plots on top.
ax[i].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.7, ls = (0, (3, 1, 1, 1, 1, 1)), label = 'True' if i == 2 else '')
ax[i].grid()
ax[i].set_title(horizon)
else: # Longer forecast horizons.
for j, model in enumerate(model_selector):
if (model == 'MSTL' or model == 'NHITS') and horizon in ['6M', '2Y']:
if j == len(model_selector)-1: # Need this here so it plots true in this if case.
ax2[i-3].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.7, ls = (0, (3, 1, 1, 1, 1, 1)))#, label = 'True' if i-3 == 0 else '')
continue # Skip, too long for this model.
plot_df = pd.concat([y_test[zone].iloc[:horizon_freq],
pd.Series(model_selector[model][zone][horizon],
index = y_test[zone].iloc[:horizon_freq].index,
name = 'y_hat')],axis = 1)
ax2[i-3].plot(plot_df.index, plot_df['y_hat'], c = models_palette[model], alpha = 0.6, label = f'{model} Forecast' if i-3 == 0 else '')
if j == len(model_selector)-1: # Only need to plot true once. Last so it plots on top.
ax2[i-3].plot(plot_df.index, plot_df['mw'], c = zones_palette[zone], alpha = 0.7, ls = (0, (3, 1, 1, 1, 1, 1)), label = 'True' if i == 3 else '')
ax2[i-3].grid()
ax2[i-3].set_title(horizon)
fig.suptitle(f"{zone} - Model Forecasts at each Horizon")
fig.supylabel('Megawatts (MW)')
fig.legend(loc = 'upper right', ncol = 2, fontsize = 8.2)
fig.tight_layout()
fig2.suptitle(f"{zone} - Model Forecasts at each Horizon")
fig2.supylabel('Megawatts (MW)')
fig2.legend(loc = 'upper right', ncol = 2, fontsize = 8.2)
fig2.tight_layout()
plt.show()
Several resources that helped along the way in no particular order.
Keeping all the energy generation code here until I decide how it fits into this project. Might spin off into its own depending on timing.
data_folder_gen = current_wdir + '/Data/Gen_by_Fuel'
file_path_gen = [f'{data_folder_gen}/{file}' for file in os.listdir(data_folder_gen) if '.csv' in file]
file_path_gen = sorted(file_path_gen)
gen_df = pd.concat([pd.read_csv(file, compression = 'gzip') for file in file_path_gen], join = 'outer', ignore_index = False, axis = 0)
display(gen_df)
display(gen_df.dtypes)
| datetime_beginning_utc | datetime_beginning_ept | fuel_type | mw | fuel_percentage_of_total | is_renewable | |
|---|---|---|---|---|---|---|
| 0 | 1/1/2017 4:00:00 AM | 12/31/2016 11:00:00 PM | Coal | 34820 | 0.41 | False |
| 1 | 1/1/2017 4:00:00 AM | 12/31/2016 11:00:00 PM | Gas | 11169 | 0.13 | False |
| 2 | 1/1/2017 4:00:00 AM | 12/31/2016 11:00:00 PM | Hydro | 699 | 0.01 | True |
| 3 | 1/1/2017 4:00:00 AM | 12/31/2016 11:00:00 PM | Multiple Fuels | 266 | 0.00 | False |
| 4 | 1/1/2017 4:00:00 AM | 12/31/2016 11:00:00 PM | Nuclear | 34269 | 0.41 | False |
| ... | ... | ... | ... | ... | ... | ... |
| 21095 | 1/1/2024 5:00:00 AM | 1/1/2024 12:00:00 AM | Oil | 221 | 0.00 | False |
| 21096 | 1/1/2024 5:00:00 AM | 1/1/2024 12:00:00 AM | Other Renewables | 690 | 0.01 | True |
| 21097 | 1/1/2024 5:00:00 AM | 1/1/2024 12:00:00 AM | Solar | 12 | 0.00 | True |
| 21098 | 1/1/2024 5:00:00 AM | 1/1/2024 12:00:00 AM | Storage | 0 | 0.00 | False |
| 21099 | 1/1/2024 5:00:00 AM | 1/1/2024 12:00:00 AM | Wind | 4004 | 0.04 | True |
783663 rows × 6 columns
datetime_beginning_utc object datetime_beginning_ept object fuel_type object mw int64 fuel_percentage_of_total float64 is_renewable bool dtype: object
gen_df[date_cols] = gen_df[date_cols].apply(pd.to_datetime, format = '%m/%d/%Y %I:%M:%S %p', utc = False)
gen_fuel_df = gen_df[['datetime_beginning_ept', 'fuel_type', 'mw']].groupby(['datetime_beginning_ept', 'fuel_type'], as_index = False).sum()
palette = ['#570408', '#005d5d', '#1192e8', '#9f1853', '#ee538b', '#012749', '#009d9a', '#002d9c', '#fa4d56', '#198038', '#b28600', '#6929c4']
ax, fig = plt.subplots(figsize = (14, 4))
ax = sns.lineplot(gen_fuel_df, x = 'datetime_beginning_ept', y = 'mw', hue = 'fuel_type', alpha = 0.7, lw = 1, palette = sns.color_palette(palette, 12))
ax.margins(x = 0)
ax.set_xlabel('Time')
ax.set_ylabel('Megawatts (MW)')
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles = handles, labels = labels)
plt.legend(loc = 'upper left')
plt.show()
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Coal'], x = 'datetime_beginning_ept', y = 'mw', lw = 1)
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Gas'], x = 'datetime_beginning_ept', y = 'mw', lw = 1)
ax, fig = plt.subplots(figsize = (14, 8))
ax = sns.lineplot(gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Solar'], x = 'datetime_beginning_ept', y = 'mw', lw = 1)
# df = pd.DataFrame({'ds' : gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Gas']['datetime_beginning_ept'],
# 'y' : gen_fuel_df.loc[gen_fuel_df['fuel_type'] == 'Gas']['mw']})
# m = Prophet()
# m.fit(df)
# future = m.make_future_dataframe(periods=365*2)
# display(future.tail())
# forecast = m.predict(future)
# display(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail())
# fig1 = m.plot(forecast)
# ax = fig1.gca()
# plt.axvline(max(df['ds']), color = 'purple', ls = '--')
# # Sourced from the prophet repository and repurposed to include:
# # - Delineate observed data and forecasted data (dashed line)
# # - Separate observed data and forecasted data for different visuals
# # - Color changes
# # - Zoom into the last few years to better visualize forecast (currently hardcoded)
# def prophetplot(m, fcst, ax=None, uncertainty=True, plot_cap=True, xlabel='Time', ylabel='Megawatts (MW)',
# figsize=(10, 6), include_legend=False):
# """Plot the Prophet forecast.
# Parameters
# ----------
# m: Prophet model.
# fcst: pd.DataFrame output of m.predict.
# ax: Optional matplotlib axes on which to plot.
# uncertainty: Optional boolean to plot uncertainty intervals, which will
# only be done if m.uncertainty_samples > 0.
# plot_cap: Optional boolean indicating if the capacity should be shown
# in the figure, if available.
# xlabel: Optional label name on X-axis
# ylabel: Optional label name on Y-axis
# figsize: Optional tuple width, height in inches.
# include_legend: Optional boolean to add legend to the plot.
# Returns
# -------
# A matplotlib figure.
# """
# user_provided_ax = False if ax is None else True
# if ax is None:
# fig = plt.figure(facecolor='w', figsize=figsize)
# ax = fig.add_subplot(111)
# else:
# fig = ax.get_figure()
# fcst_t = fcst['ds']#.dt.to_pydatetime()
# # Initialize the observation and forecast data variables
# # Finds and separates the two sets by using the max date in (observed) input data.
# obs_t = fcst_t.loc[fcst_t < max(df['ds'])]
# obs_yhat = fcst['yhat'].iloc[ : fcst['ds'].loc[fcst['ds'] <= max(df['ds'])].index[-1]]
# obs_lower = fcst['yhat_lower'].iloc[ : fcst['ds'].loc[fcst['ds'] <= max(df['ds'])].index[-1]]
# obs_upper = fcst['yhat_upper'].iloc[ : fcst['ds'].loc[fcst['ds'] <= max(df['ds'])].index[-1]]
# fc_t = fcst_t.loc[fcst_t >= max(df['ds'])]
# fc_yhat = fcst['yhat'].iloc[fcst['ds'].loc[fcst['ds'] >= max(df['ds'])].index[0]:]
# fc_lower = fcst['yhat_lower'].iloc[fcst['ds'].loc[fcst['ds'] >= max(df['ds'])].index[0]:]
# fc_upper = fcst['yhat_upper'].iloc[fcst['ds'].loc[fcst['ds'] >= max(df['ds'])].index[0]:]
# ax.plot(m.history['ds'].dt.to_pydatetime(), m.history['y'], '.', c = '#012749', markersize = 1, alpha = 0.9,
# label='Observed data points')
# ax.plot(obs_t, obs_yhat, ls='-', c='#005d5d', label='Forecast', lw = 0.75, alpha = 0.9)
# ax.plot(fc_t, fc_yhat, ls='-', c='#382238', label='Predict', lw = 0.75, alpha = 1)
# if 'cap' in fcst and plot_cap:
# ax.plot(fcst_t, fcst['cap'], ls='--', c='k', label='Maximum capacity')
# if m.logistic_floor and 'floor' in fcst and plot_cap:
# ax.plot(fcst_t, fcst['floor'], ls='--', c='k', label='Minimum capacity')
# if uncertainty and m.uncertainty_samples:
# ax.fill_between(obs_t, obs_lower, obs_upper,
# color='#009d9a', alpha=0.4, label='Uncertainty interval')
# ax.fill_between(fc_t, fc_lower, fc_upper,
# color='#753f75', alpha=0.4, label='Uncertainty interval')
# # Specify formatting to workaround matplotlib issue #12925
# locator = AutoDateLocator(interval_multiples=False)
# formatter = AutoDateFormatter(locator)
# ax.xaxis.set_major_locator(locator)
# ax.xaxis.set_major_formatter(formatter)
# ax.grid(True, which='major', c='gray', ls='-', lw=0.75, alpha=0.2)
# ax.set_xlabel(xlabel)
# ax.set_ylabel(ylabel)
# # Zoom in x-axis closer to prediction.
# ax.set_xlim(pd.to_datetime(['2020-01-01 00:00:00', '2026-01-01 00:00:00']))
# ax.set_ylim(0, 90000)
# # Prediction divider line.
# plt.axvline(max(df['ds']), color = 'black', ls = '--', lw = 1.2)
# # Linear trend line to calculate yearly increase.
# obs_poly_fn = np.poly1d(np.polyfit(range(len(obs_t)), obs_yhat, 1))
# plt.plot(obs_t, obs_poly_fn(range(len(obs_t))), c = '#913529')
# fc_poly_fn = np.poly1d(np.polyfit(range(len(fc_t)), fc_yhat, 1))
# plt.plot(fc_t, fc_poly_fn(range(len(fc_t))), c = '#252f9c')
# if include_legend:
# ax.legend()
# plt.legend(loc = 'upper left')
# if not user_provided_ax:
# fig.tight_layout()
# return fig
# prophetplot(m, forecast)
# plt.show()